There's an FAQ posted to comp.databases.oracle newsgroup every month or
so. It is also available via anon ftp from rtfm.mit.edu, the home of all FAQs. --- That seems very low, we set ours to !"#$ so that the %$A can ma&e their '(A's bigger to buffer more data. It all depends on the type of db you're running, of course. --- )ou can set '**#A+ to anything up to !($, it does not have any adverse effect on performance. (enerally rule of thumb is that it should be greater than your '(A and sensibly about ,-. of your physically /A#. 0e have 1racle ,.2.2 on !.-. on "-way '3A/45556 ,-7mb /A#. *ere's our 8etc8system entries 9 ::: 'et 'hared #emory 8 'emaphores for 1racle set semsys;seminfo<semmni=!55 set semsys;seminfo<semmns=!55 set semsys;seminfo<semmsl=!5 set shmsys;shminfo<shmma>=?@@5""7? set shmsys;shminfo<shmmin= set shmsys;shminfo<shmmni=-! set shmsys;shminfo<shmseg=5 forceload; sys8msgsys forceload; sys8shmsys forceload; sys8semsys *106A6/, see attached file for what the e>perts say. --------------------------------- 4ut *ere --------------------------------- 1ptimiBing and #easuring the 'olaris Cernel For Darge 1racle 'ervers. by #i&e Eaffee, 'un #icrosystems The first part of the paper will discuss the basics of 'olaris Internals that are relevant to the 1racle %$A along with tips to common technical Fuestions and relevant header files. The second part is Fuoted tuning information ta&en from 'un 6>perts. The final part is a discussion of &ernel memory allocation, how to measure it, and some things that can be done to prevent starvation. 'olaris Internals 'parc has two rings of e>ecution. The inner ring is for &ernel functions and the outer ring is for user process functions. The process address space is virtual, and normally only part of a process is in physical memory. The &ernel stores the contents of the process address space in physical memory, on-dis& files, and specially reserved swap areas. 1ver time the &ernel shuffles pages of the processes between physical memory and dis&. 6ach process has registers that are stored in the &ernel and are place in the hardware registers at run time. A process must bloc& if it is waiting for a resource and allow another process to run. The &ernel allows each process a brief period of time, usually 5 milliseconds, to run before performing a conte>t switch. GAahalia p.!5-!-H 1n startup once the &ernel is loaded, user processes can reFuest system services from the &ernel through the system call interface. If the process misbehaves by dividing by Bero or overflow its stac&, a hardware e>ception occurs, and the &ernel intervenes, usually aborting the process. Interrupts come from peripheral devices usually indicating a status change or I81 completion. Two important processes that manage memory are the swapper and pagedaemon. GAahalia p.!!-!-H 6ach process has a virtual memory address space GA#AH that is translated to physical memory addresses by page tables. This mapping is done by the chip's ##I. GTip - 'ystem panics can be either hardware or software related. The ##I registers give helpful hints on what actually caused the panic.H In addition to &ernel and user mode, there is &ernel and user space. This refers to regions in virtual memory address space of the process. There is only one &ernel and many processes and hence every process must map in a single &ernel address space. The &ernel portion of the A#A maintains global data structures and some per process obJects. These can only be accessed by the &ernel when the chip is running in &ernel mode Gring 5H. 'ince the &ernel is shared by all processes, &ernel space must be protected by user-mode access. This is done by reFuiring the processes to use the system call interface. This reFuires the chip to go into &ernel mode, transfer program control to the &ernel, have the &ernel e>ecute system code instructions, then switch bac& to user mode and user control of the process. GAahalia p.!!-!2H 'ystem 'ervices 1racle uses many 'olaris system services such as file and record loc&ing, inter process communications, virtual memory, and process scheduling. 4ommon system calls are open, read, write, fcntl, &ill, priocntl, ploc&, memcntl, sync. 4ommon 'ignals are 'I('6(A - usually means user stac& overflow, 'I($I' - out of the process address space, 'I(T6/# - user has Khung upK without e>iting gracefully, 'I(I'/ - defined signal for asynchronous events, 'I(CIDD - &ill process immediately no e>ceptions. 1racle uses file and record loc&ing by setting read write loc&s on portions of a file. Any process can read a file that is loc&ed but only the owner of the loc& can update the file. A write loc& is sometimes called an e>clusive loc& and a read loc& is sometimes called a shared loc&. 3rocess scheduling is usually managed very well by the &ernel, however a slow Job can be speeded up by the priocntl system call. G'ystem 'ervices (uide p.-!-H Eim '&een of 'unsoft - K1racle gets loc&ed- down memory as a conseFuence of using intimate shared memory GI'#H, not through ploc&. It controls sharing inside shared memory through latches, not memcntl or ploc&.K *e also cautions against changing the priority of the 1racle processes KThis is something we in %$6 actually strongly discourage.
1nly the most daring and &nowledgable %$A's should attempt this. The problem is that system threads can get starved if 1racle processes are not Kwell behavedK when running in real time class. 1racle processes may easily hog a cpu for e>tended periods of time Gtime being measured in Ini> FuantumsH. 0e in %$6 have e>perimented with changing the dispatch table in useful8clever ways, to minimiBe the number of involuntary conte>t switches. $ut 1racle processes still run in T' class.K Gprivate letter '&eenH 1racle Internals and 'olaris 'ystem 'ervices #ar& Eohnson of 1racle and Eim '&een provide the following e>pert insight and information. The system global area is defined as K1ne or more shared segments visible to all 1racle processes that are used to store precompiled 'QD and 3D8'QD Glibrary cacheH, database buffers Gbuffer cacheH, and for interprocess communicationK GEohnsonH. As far as process control - K1racle does use semaphores, but latches are the usual synchroniBing mechanism, as mute>es implemented as spin loc&sK GEohnsonH. 1n the subJect of loc&s K1racle maintains database transaction integrity through use of database loc&s of various sorts--shared read, e>clusive read, e>clusive write, etc. These are implemented through database loc&s, not using Ini> file loc&s. Thus, the scope of a database loc& can be limited to a single row in the database. 1r, the database may choose to loc& a database page Gwhich may be Fuite a bit smaller than a Ini> pageH. 1r, the database may choose to loc& an entire database table Gwhich may be composed of multiple database files, which in turn may or may not map into Ini> filesH.K Gprivate letter '&eenH. 1racle uses heavyweight processes that are in the shared memory portion of the process address space. The %$0/ Gdata buffer writerH process uses aio threads &nown as light weight processes GD03H. An D03 is a &ernel-supported user thread that is based on &ernel threads. They are independently scheduled and share the address space of the process. Aahalia's boo& has a nice discussion on D03s. GEaffeeH Cernel Asynchronous I81 and Intimate 'hared #emory are two &ey technologies used by 1racle on the 'olaris platform. Asynchronous I81 is needed because a single bloc&ing thread in a multi- threaded application causes all threads to wait until the thread wa&es up. 0hat needs to happen is for the thread to issue an asynchronous I81 reFuest and then pass control to another thread in the process. Also heavy I81 is not efficient when done synchronously because of the large number of conte>t switches that must occur every time a thread is bloc&ed. G*yuc& )ooH Asynchronous I81 under 'olaris is implemented two ways - under 'olaris !.2 it is using the library and under 'olaris !.? and beyond it is in the file system layer of the &ernel. The library approach uses &ernel-level threads where each I81 reFuest is handled by a newly created &ernel-level thread that acts synchronously Gi.e. issuing read and write callsH. The library lives outside of the &ernel and the &ernel threads that perform the I81 are separate from the calling process. The &ernel approach is much more sophisticated and efficient. The basic concept is to not maintain the Fueue in user space but to put the reFuest directly into the device driver Fueue. The biowait function is bypassed Gwhich is the device driver eFuivalent to a bloc&ing functionH and the thread transfers control rather than sleep in the &ernel. The &ernel has buffers with slots called AI1 that maintain a listing of all I81 reFuests. G*yuc& )ooH 'olaris has provided the I'# feature since !.!. The main feature of I'# is in addition to sharing the KmemoryK pages Gli&e the normal shared memoryH, it also shares the page table entries for those pages Gtherefore, it's KintimateKH. Another side feature, which is more important for this discussion, is that I'# also loc&s down the shared memory segment in real physical /A#. 'ince the main purpose of I'# is for the %$#' products' buffer cache usage, this ma&es sense. GEaffeeH 'haring page table entries solves the problem of page table stealing which is e>pensive because all the pages mapped in the stolen page table have to be flushed before being given to another process. This avoids the condition where the whole system may thrash as processes steal page tables from each other. G*. )ooH The design team created a new segment in the process address space called segshm so that they could create one set of page tables for a shared memory segment and share the page tables among the processes that attach that same shared memory. In addition to saving page table allocation, sharing page tables have other advantages such as having a higher cache hit rate on memory map loo&ups because the tables are in a buffer cache rather than in memory. It also avoids the amount of overhead done by the hardware address translation layer since it no longer needs go through page tables for every process to monitor whether a page has been modified. These are both huge savings and speed up the virtual memory paging algorithm within 'olaris. G*. )ooH I34 The 1racle /%$#' is a comple> program that uses multiple cooperating processes that must communicate with each other and share resources. The &ernel provides a mechanism in user space called inter process communication or I34. The processes operate in a shared memory segment such that if one process modifies data it will be immediately visible to the other processes. %ata transfer and event notifications occur between the various 1racle processes in the 1racle '(A. 'emaphores are used for 1racle's own loc&ing and synchroniBation scheme. Asynchronous events such as errors are reported to the processes using signals. The default action for most signals from the &ernel is to terminate the process, however the process may specify an alternate response by providing a signal handler function. GTip - $efore installing the &ernel Jumbo patch read the readme file to see if there are any &nown signal problems with 1racleH. GAahalia - p-5H The relevant I34 system calls 1racle ma&es are shmget, semget, shmat, shmdt, shmctl, and semctl. The ipc information is stored in the &ernel with the ipc<perm structure. shmgetG&ey, siBe,flagH creates a portion of shared memory Gwhich will be the siBe of the 1racle '(AH and shmatGshmid, shmaddr, shmflagH attaches the region to a virtual memory address of the process. Gshmsys is how 1racle sets up the intimate shared memory segmentH. The structure of a shared memory segment includes access permission, segment siBe, the 3I% of the process performing last operation, and the memory map segment descriptor pointer as well as other fields. Gtip - sgabeg in the &sms.s file is a virtual address not physical address G5-5>ffffffff = ! ($H. 4hoose small beginning addresses for large '(As. Also watch out for !" bit 'parc chips. They have a smaller virtual addresses. *al 'tern notes KThey're really not !" bit chips, but instead the system architecture only passes !" bits of virtual address space on to the memory bus. Lprivate letterMH 1nce attached the region may be accessed li&e any other memory location without reFuiring system calls to read or write data to it. *ence shared memory is the fastest mechanism for processes to share data. GTip - don't be confused by the 'N field in ps -elf. It is in ? C$ pages and represents shared memory in the case of 1racle. For e>ample 1racle may have 75 server processes in a shared memory segment all appro>imately !-555 ? C$ pages. A common misconception is to thin& that 1racle needs 75 + ?C$ + !-555 = 7 ($ of virtual memory. Those 75 processes are mainly using the shared memory region in the process address spaceH. GTip - shared memory pages are bac&ed by swap space, not by a file. The absolute minimum swap must be at least the siBe of the '(A.H A process detaches the shared memory with shmdtGaddrH and destroys the shared memory region completely with the I34</#I% command of the shmctl system call. GTip - the important commands are ipcs -b9 loo& at field '6('N for shared memory siBe in use 9 sysdef -i and sysdef -i -n 8dev8&syms for I34 and resource table definitions9 &ill -@ Oprocess idP to terminate Gno core fileH a hung process or &ill -7 Oprocess idP to abort Gcore fileH a hung 1racle process. modload -p sys8shmsys at the command line or forceload; sys8shmsys in the system file maybe needed if ipcs -b doesn't wor&H correctly. This is because the &ernel is dynamic meaning that file systems, drivers, and modules are loaded into memory when they are used, and the memory is returned if the module is no longer needed. GAahalia - p----", p7!-7?H 'emaphores are counters that are used by 1racle to monitor and control the availability of shared memory segments. Typically the process initialiBes the semaphore with semget, assigns ownership of the semaphore with semctl , and then updates the semaphore with semop. A process has to bloc& until the semaphore operation has reached Bero. A semaphore structure contains the following information - semaphore value, the 3I% of the process that last performed successfully, the number of processes waiting for the semaphore to increase, and the number of processes waiting for the semaphore to reach Bero. Gtip-ipc<perm and sem in ipc.h, sem.hH G'ystem 'ervices (uide - p7"-,,H. 'hared #emory and 'emaphore Tunables in 'olaris ! relevant to 1racle. GTip - semmnu = semmns = semmsl + semmniH. There is no harm in setting the numbers too high since the 1racle instance will only allocate semaphores and shared memory as needed. The values are definitions not declarations. Qame %efault #in #a> /eference 'uggested <<<< <<<<<<< <<< <<< <<<<<<<<< <<<<<<<< shmma> 5?"-,7 5?"-,7 Available #a>imum shm segment -5. of /A# /A# siBe in bytes shmmin - #inimum shm segment siBe in bytes shmni 55 55 - Qumber of shm id 55 to pre-allocate shmseg 7 7 - #a>imum number shm 2! seg per process semmni 5 5 7--2- Qumber of semaphore 7? identifiers semmns 75 - - Qumber of semaphores 755 in system semmnu 25 - - Qumber of undo !-5 structures in sys semmsl !- - - #a>imum number of !- Gfi>edH semaphores per I% 'olaris Tuning According to the 6>perts 6very month in 'un0orld 1nline, the performance e>perts at 'un write articles on tuning. In addition to the well &nown boo&, K'un 3erformance and TuningK, Adrian 4oc&croft with the help of /ich 3ettit have put together a series of scripts called se!.- Gwww.sun.com8@75258columns8adrian 8se!.-.html. *al 'tern, another well &nown 'un tuning guru, has written an 1'/eilly press boo& on K#anaging QF' R QI'K and he too writes articles that can be downloaded off of the web. Fellow 'un'ervice 6ngineers 4hris %ra&e and Cimberley 0oods wrote K3anic - 'ystem 4ore dump AnalysisK which contains detailed information on the 'olaris &ernel and common techniFues used in to analysis core files. $rian 0ong the hardware e>pert has written a boo& called K4onfiguration and 4apacity 3lanning of Darge 'un 'erversK. #ost of the tuning information for large 'un 'ervers running 1racle can be found in these sources. 'ince many customers often call 'un'ervice for further e>planations, it is appropriate to highlight some common Fuestions and answer them as the e>perts would. Question - 0here is all my #emoryS 3robably the most common performance Fuestion of all is K0hy does vmstat report only >>>> about of free memory availableSK To use an e>ample, type the vmstat - and suppose the system shows freemem of "5,5" and available swap is 225555. Qow start the application and observe that the freemem goes down to ""!? and swap goes to 255555. Qow stop the application and observe that all of the available swap returns to 225555 but the freemem returns only to !!75. 0here then is all of the ramS %oes we have a memory lea&S The answer is probably no because as 4oc&croft notes KGthe appH starts up more Fuic&ly than it did the first time, and with less dis& activity. The application code and its data files are still in memory, even though they are not active. The memory they occupy is not Kfree.K If you restart the same application it finds the pages that are already in memory. The pages are attached to the inode cache entries for the files. If you start a different application, and there is insufficient free memory, the &ernel will scan for pages that have not been touched for a long time, and KfreeK them. 1nce you Fuit the first application, the memory it occupies is not being touched, so it will be freed Fuic&ly for use by other applications. KG4oc&croft H Deaving parts of the app in memory even after termination is efficient because KAttaching to a page in memory is around ,555 times faster than reading it in from dis&.K G4oc&croft H 'o how can one &now if he has a memory lea& in his applicationS The answer is there will be a shortage of swap space after the program runs a while and the 'N field in ps -elf for that app will grow over time. Question ! - #y 1racle 'erver is slow. 4an you help me tune the &ernelS The answer depends on the version of the operating system and the level of the patches. 6arly versions of the os had performance bugs and incompatible hardware that were the cause of slow performance. The latest version of the os is self-tuning for high performance and will wor& Fuite successfully on systems ranging from a huge 'parc4enter !555 to small des&tops. As 4oc&croft says KIn normal use there is no need to tune the 'olaris ! &ernel, since it dynamically adapts itself to the given hardware configuration and application wor&load. K G4oc&croft !H *owever for really large 1racle servers some tuning may be needed if using early versions of 'olaris !.2 !.? and !.- without a &ernel patch that automatically adJusts the the paging algorithm. 'olaris !.-. is self tuning for large memory systems. 3aul Faramelli of the &ernel T'6 group has put together the following list of tunables for 'olaris. /ecommendations for large 1racle servers G/am P ($H are listed. GTip - Ise crash to display &ernel tunables. As root type crash. At the greater than prompt, type Kod -d ma>userK or Kod -d lotsfreeK. The od stands for octal dump, and the -d stands for decimal. $y the way every 'olaris tunable Leven undocumented onesM can be displayed by typing nm 8&ernel8uni>H. Qote these recommendations are only necessary for early versions of 'olaris. The some recommendations are provided by 'teve 1'Qeil of 'un'ervice. G4aution - there is no right answerH 3arameter %escription /ecommended --------- ----------- ----------- dump<cnt 'iBe of the dump
autoup Ised in struct var for dynamic configuration of the age 255 that a delayed-write buffer must be, in seconds, before bdflush will write it out Gdefault = 75H
bufhwm Ised in struct var for v<bufhwm9 it's the high water mar& "555 for buffer cache memory usage, in Cbytes G!. of memoryH.
ma>users #a>imum number of users GIn !.2 and !.? the default is
number of #egabytes in memoryH
ma><nprocs #a>imum number of processes G5 T 7 : ma>userH
ma>uprc The ma>imum number of user processes. Gma><nprocs - -H
rstchown 31'I+<4*10Q</6'T/I4T6% is enabled Gdefault = H
ngroups<ma> #a>imum number of supplementary groups per user Gdef 2!H.
rlim<fd<cur #a>imum number of open file descriptors per process sysem
wide Gdefault = 7?, ma> = 5!?H
ncallout Qumber of callout buffers Gdefault = 7 T ma><nprocsH.
GQo longer e>ists in 'olaris !.! and later releasesH
nautopush Qumber of entries in the autopush free list 5!? sadcnt Qumber allowed of concurrent opens of both 8dev8sad8user !5?" and 8dev8sad8admin Gdefault 7H.
npty Qumber of ?.+ psuedo-ttys configured Gdefault ?"H 5!? pt<cnt Qumber of -.+ psuedo-ttys configured Gdefault ?"H 5!? physmem 'ets the number of pages usable in physical memory. 1nly
use this for testing, it reduces the siBe of memory.
minfree #emory threshold which determines when to start swapping 55 processes, when free memory falls to this level swapping
fastscan The number of pages scanned per second when free memory
is Bero, the scan rate increases as free memory falls
from lotsfree to Bero, reaching fastscan G default; !.?
physmem 8 ? with 7?#b being ma>, !.2 physmem 8 ! H.
slowscan The number of pages scanned per second when free memory
is eFual to lotsfree, also see fastscan G defaults; !.?
is fi>ed at 55, !.2 fastscan 85 H.
handspr- Is the distance between the front hand and bac&hand in
eadpages the cloc& algorithm. The larger the number the longer an
idle page can stay in memory Gdefault; !.? physmem 8 ?
!.2 physmem 8 ! H.
ma>pgio The ma>imum number of page-out I81 operations per second. !5 This acts as a throttle for the page deamon to prevent
page thrashing GG%I'C/3# : !H 82 = ?5H. This parameter must be set higher if using two swap partitions. t<gpgslo !. through !.2, Ised to set the threshold on when to
swap out processes Gdefault !- pages H.
ufs<ninode #a>imum number of inodes. Gma><nprocsT7Tma>usersT7?H 2?@57 ndFuot Qumber of dis& Fuota structures. Gdefault = Gma>users :
Q#1IQT 8 ?H T ma><nprocsH
ncsiBe Qumber of dnlc entries. Gdefault = ma><procs T 7 T 2?@57 ma>users T 7?H9 dnlc is the directory-name loo&up cache
4oc&croft on ma>users KI never set ma>users. It siBes itself based on the amount of /A# in the system. In some cases on configurations with gigabytes of /A# it needs to be reduced to avoid problems with lac& of &ernel address space. The &ernel uses up a lot of space &eeping trac& of all the /A# in a system. 'everal other &ernel table siBes and limits are derived from ma>users.K G4oc&croft !H 4oc&croft on ncsiBe KThe directory name loo&up cache G%QD4H is siBed to a default value based on ma>users. A large cache siBe GncsiBeH significantly helps QF' servers that have a lot of clients. 1n other systems the default is adeFuate.KG4oc&croft !H Question 2; *ow much swap is needed for a large 1racle databaseS #any people are under the impression that very little swap is needed for 1racle because the architecture uses temporary tablespaces for sorting and the '(A is fi>ed in memory. 0ell the truth is large databases reFuire a lot of swap. The shared memory segment is bac&ed by swap so the allocated swap #I'T be at least as large as the shared memory segments. In addition when the database uses intimate shared memory this is also bac&ed by swap. All of the 1racle processes must be partially bac&ed by swap. 'teve 'chuettinger, the 1racle applications specialist at 'un, recommends at least ! ($ of swap for benchmar& testing on large servers. 1bviously since /A# plus swap eFuals virtual memory, once swap is gone, the program will halt and no new apps can be started until other programs have stopped. As Adrian 4oc&croft says KThe important thing to realiBe about swap space is that it is the combined total siBe of every program running and dormant on the system that matters. 0hen a system runs out of swap space it can be very difficult to recover. 'ometimes you find that there is insufficient swap space left to login as root or run the commands needed to &ill the errant process that is consuming all the swap space.K G4oc&croft 2H In Theory 'olaris ! changes the rules by adding the /A# and the dis& space so if the system has enough /A# for the wor&load, Kit can run with no swap dis&. In practice common database applications that are siBed to run in a few gigabytes of /A# will actually need many gigabytes of dis& allocated as swap space.K G4oc&croft 2H In the same article 4oc&croft says KThe conseFuences of running out of swap space affect a larger number of users on a big server, so it wise to allocate a lot more than you normally need to cope with any usage pea&s. To start with, add twice as much dis& as you have /A#.K G4oc&croft 2H GTip - It is not worth ma&ing a striped metadevice to swap on - that would Just add overhead and slow it down. There is also a limit of ! gigabytes on the siBe of each swap partition, so striping dis&s together tends to ma&e them too big. 8usr8ucb8ps al>, fields 'N or 'IN6, 8usr8proc8bin8pmap . 8usr8ucb8ps al> F II% 3I% 33I% 43 3/I QI 'N /'' 04*AQ ' TT TI#6 41##AQ% " !-@- 22 25 5 ?" !5 @"" 275 modlin&a ' pts8? 5;55 -bin8csh There is confusion between what ps reports. The K8bin8ps prints a field labelled 'N, but this is the resident set siBe in /A# -- printed as /'' by the 8usr8ucb8ps. )ou need to use the 'N or 'IN6 field reported by 8usr8ucb8ps al> in units of &ilobytes to determine the amount of swap space used by the process.K G4oc&croft 2H 1racle's #ar& Eohnson adds the following KI had thought the standard 1racle rule of thumb was ! to ? times physical memory Gcan be a bit less on very large memory systemsH. 'maller memory systems may want to use higher ratios of '(A siBe to physical memory siBe and higher swap space ratios. GI ended up using ratios of ; and ;? for a very small 'olaris for Intel system with surprisingly good results.HK *al 'tern says K'o why do you need swap space if your '(A OO phys memS The short answer is that the Kphys memK in that calculation is the non-loc&ed- down physical memory, and when you allocate an oracle '(A, you allocate intimate shared memory GI'#H that is ta&en out of the physical memory pool Gie, it gets loc&ed downH. so on a (byte machine, you may thin& you're o& with a !-7# '(A, leaving ,55#T for processes. $IT; the !-7# '(A gets ta&en out of the available memory pool, so your ma>imum A# is only ,55#T, and you could probably use the swap space....as the '(A8memory ratio goes up, this is even more true.K Gprivate letter from 'ternH Question ? - 0ill a faster cpu help performanceS The answer is not easy to answer. As *al 'tern noted K Qoticing that you're using !5 percent of the 43I doesn't mean anything until you &now the &ind of wor& that's using the cycles. If you're 43I-bound, then you have headroom to increase the wor&load by a factor of four or five. An I81-bound Job, however, that uses !5 percent of the 43I might be improved by adding dis& spindles. As you increase the dis& count and I81 load, to ease the bottlenec&, you'll use more 43I to deal with the I81 setup, system calls, and interrupts from the additional wor&. )ou run the ris& of morphing a dis& problem into a 43I shortage. *ow do you &now when rela>ing one constraint pops another one into the foregroundS %efine the right relationships -- 43I time used per dis& I81 tells you how much system time you eat up as you add dis& load -- and measure with your tailored yardstic&.K G'tern H 3reventing Cernel #emory 'tarvation 0hen 1racle is wor&ing very hard and the operating system is 'olaris !.2 or early 'olaris !.?, it is possible to have &ernel memory allocation faults that can eventually lead to &ernel memory starvation. A new memory allocator algorithm has been developed and integrated into 'olaris !.-. Gthe old allocator had paging thresholds that were too low which causing &ernel memory allocation failures on very large systemsH. The allocator has been bac& ported to rev ?5 of the 'olaris !.? Jumbo patch and to a future rev of the !.- Jumbo patch. Qo fi> has yet been developed for 'olaris !.2. GTip - large database users should upgrade to 'olaris !.? or betterH. In the past 1racle customers could manually adJust paging thresholds. The actual value that needed to be set was proportional and depended upon the amount of memory and the number of cpus on the system. Also in some cases decreasing ma>users and bufhwm would mitigate the problem. The total allowable siBe for the &ernel on the ultrasparc servers running !.- is now so large that &ernel memory allocation problems on very large systems is virtually impossible. 'ee e>amples below. The crash output displaying &ernel memory starvation is ta&en from a 'parc'erver 555 running 'olaris !.2 with ($ of ram and " cpus. 'olaris !.?; 'olaris !.-; Cernel memory limits sun?c 22#$ sun?c 22#$ sun?m 7#$ sun?m 55#$ sun?d 2@#$ sun?d !-#$ sun?u !-!-#$ UP &as crash - Pmap &ernelmap F/66; !5?! 0AQT; 'IN6; !5?! 'IN6 A%%/6'' T1TAD QI#$6/ 1F '6(#6QT' 5 T1TAD 'IN6 5 P &mastat total bytes total bytes siBe V pools in pools allocated V failures ----------------------------------------------------------------- small 7"5, !72"""5 !-7,,-"? @"@@- big !7-! ,-!,7!"" ,25?7-!" 5 outsiBe - - "-,!7? ?-2- 4rash is a very powerful tool that helps analyBe &ernel memory allocation failures. 0e see from the output KT1TAD 'IN6 5K indicates that no more free &ernel memory e>ists. The F/66 field G!5?!H indicates that there is still plenty of memory in the user portion of the virtual address space. 4arl of 'unsoft provides an e>planation of &ernel map scarcity under 'olaris !.2 and 'olaris !.?. KIn the overwhelming maJority of cases on large database servers, we have found that 7?#$ is overly generous for bufhwm in that it can be cut bac& by one-half Gto 2!#$H without too much of an impact on the cache hit ratio. 0hat is usually in short supply on these machines is not the buffer cache but the amount of &ernel heap Gmapped by &ernelmapH that remains for non-buffer cache usage. Dimiting buffer cache growth to 2!#$ frees up an addition 2!#$ to the heap and has proven successful in avoiding &ernelmap scarcity at a number of sites running large database applications. Cernelmap scarcity Gor eFuivalently &ernel heap scarcity as the siBe of the &ernel heap is limited by the siBe of the address space the &ernelmap can mapH results in an e>treme slowdown of processing in the systems. All of a sudden &ernelmap becomes a scarce resource that every thread contends for and to e>acerbate the situation the rate of release is slowed by the very same contention to the point that &ernelmap turnover grinds down almost to the point of deadloc&. 0hy 7?#$'s worth of &ernelmap is inadeFuate for the largest database servers is un&nown. The sites on which this has been a problem have been chec&ed for &ernelmap lea&age and none has been found. There has also been a problem in the past with some &ernel data structures being pre allocated from the heap and the siBe of this pre allocation being inappropriately scaled to physical memory. As it is fairly common now for machines to be eFuipped with 2($ of physical memory, this was not the right thing to do and did account for some &ernelmap depletion headaches. $ut this particular bug has been fi>ed. 0ith these two things discounted, the only conclusion is that modern database wor&loads are driving up pea& transient demands for &ernelmap to the 55#$ level.K GTip -For large databases running 'olaris !.? or less set bufhwm to "555 on ?c, ?m, and ?d or upgrade to 'olaris !.- which has a large &ernel map address space.H Ac&nowledgements I want to than& 'un performance gurus Adrian 4oc&croft and *al 'tern for their contributions to this paper. IQI+ architect #ar& Eohnson of 1racle and database e>pert Eim '&een of 'unsoft provided comments on 1racle internals. Cernel architect Eeff $onwic& has added e>planations and suggestions regarding &ernel memory allocation and &ernel memory starvation. 'un'ervice &ernel engineer 3aul Faramelli documented the 'olaris tuning parameters and 'un'ervice Technical 6>pert 'teve 1'Qeil provided recommendations for tuning large 1racle databases on versions of 'olaris that are not self tuning. Finally I want to than& Iresh Aahalia who gave me permission to Fuote at length from his wonderful boo& KIQI+ Internals - The Qew FrontiersK. %isclaimer The author alone is responsible for the contents of this paper. Qo one at 'un #icrosystems, 'unsoft, 'un'ervice, or the 1racle corporation has reviewed or approved the paper for completeness or accuracy in it's published format and nothing in the paper can be construed as the official policy of 'un #icrosystems or the 1racle 4orporation. /eferences IQI+ Internals - The Qew Frontiers by Iresh Aahalia, 3rentice *all @@7 K*ow the 'olaris Cernel is 1ptimiBed for 1racleK by #i&e Eaffee @@7 K'hared 3age Table; Airtual #emory 6nhancement for %ata 'haring in IQI+K *.)oo K4omparative analysis of Asynchronous I81 in #ultithreaded IQI+K *yuc& )oo K*elpW I've lost my memoryWK by Adrian 4oc&croft, 'un0orld1nline @@- GH K0hat are the tunable &ernel parameters for 'olaris !SK by Adrian 4oc&croft G!H K*ow does swap space wor&SK by Adrian 4oc&croft, 'un0orld1nline @@- G2H K0e suggest creative ways to better your systemK performance by *al 'tern 'ystem 'ervice (uide - 'olaris !.? #anual, 'un'oft, @@? KThe 'lab Allocator; An 1bJect-4aching Cernel #emory AllocatorK Eeff $onwic&
Eu JulienlimXroc&etmail.com $ecause e-mail can be altered electronically, the integrity of this communication cannot be guaranteed.