You are on page 1of 147

1

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Performance Analysis & Tuning of


Red Hat Enterprise Linux
Larry Woodman / John Shakshober
Kernel/ Consulting Engineer
Red Hat
September 2009

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Agenda
Section1SystemOverview
Section2AnalyzingSystemPerformance
Section3TuningRedHatEnterpriseLinux
Section4PerformanceAnalysisandTuningExamples
References

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Processors Supported/Tested
RHEL4 Limitations
x86 32
x86_64 8, 64(LargeSMP)
ia64 64, 512(SGI)

RHEL5 Limitations
x86 32
x86_64 255
ia64 64, 1024(SGI)

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Processor types
Uni-Processor
Symmetric Multi Processor
Multi-Core
Symmetric Multi-Thread(Hyper threaded)
Combinations

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Processor types & locations


[root@intels3e3601node1]#cat/proc/cpuinfo
processor:0 <logicalcpu#>
physicalid:0

<socket#>

siblings:16<logicalcpuspersocket>
coreid:0

<core#insocket>

cpucores:8<physicalcorespersocket>
#cat/sys/devices/system/node/node*/cpulist
node0:03
node1:47

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMA & Multi Core Support


Cpusets (2.6.12)
Enable CPU & Memory assignment to sets of tasks
Allow dynamic job placement on large systems
Numa-aware slab allocator (2.6.14)
Optimized locality & management of slab creation
Swap migration. (2.6.16)
Swap migration relocates physical pages between nodes in a NUMA system while the process
is running improves performance
Huge page support for NUMA (2.6.16)
Netfilter ip_tables: NUMA-aware allocation (2.6.16)
Multi-core
Scheduler improvements for shared-cache multi-core systems (2.6.17)
Scheduler power saving policy
Power consumption improvements through optimized task spreading
7

Red Hat Performance NDA Required 2009

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Typical NUMA System Layout


Process memory on N1C0

Node 0
C0

C1

Memory

Node 1
C0

C1

Memory

N N NS N N N N N N NN
0 1 23 0 1 2 3 0 1 23
Interleaved

C0

C1

C0

C1

Memory

Memory

Node 2

Node 3

(Non-NUMA)

Process memory on N1C0

N0
N1
N2 N3
Non-Interleaved (NUMA)

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMA Support
RHEL4 NUMA Support
NUMA aware memory allocation policy
NUMA aware memory reclamation
Multi-core support

RHEL5 NUMA Support


RHEL4 NUMA support (taskset, numactl)
NUMA aware scheduling
NUMA-aware slab allocator
NUMA-aware hugepages

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Memory Management

Physical Memory(RAM) Management


Virtual Address Space Maps
Kernel Wired Memory
Reclaimable User Memory
Page Reclaim Dynamics

10

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Physical Memory(RAM) Management


Physical Memory Layout
NUMA versus Non-NUMA(UMA)
NUMA Nodes
Zones
mem_map array
Page lists
Free
Active
Inactive

11

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Physical Memory Supported/Tested


RHEL4
x86 4GB, 16GB, 64GB
x86_64 512GB
ia64 1TB

RHEL5
x86 4GB, 16GB
x86_64 512GB/1TB
ia64 - 2TB

12

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Memory Zones
32-bit

64-bit
Up to 64 GB(PAE)

Highmem Zone

End of RAM

Normal Zone

896 MB or 3968MB
4GB
Normal Zone

16MB
DMA Zone
0

13

Red Hat Summit 2009 | John Shakshober / Larry Woodman

DMA32 Zone
16MB
DMA Zone
0

Memory Zone Utilization(x86)

14

DMA

Normal

24bit I/O

Kernel Static
Kernel Dynamic
slabcache
bounce buffers
driver allocations
User Overflow

(Highmem x86)

User
Anonymous
Pagecache
Pagetables

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Memory Zone Utilization(x86_64)


DMA

24bit I/O

DMA32

32bit I/O
Normal overflow

Normal

Kernel Static
Kernel Dynamic
slabcache
bounce buffers
driver allocations
User
Anonymous
Pagecache
Pagetables

15

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Per-Zone Resources

RAM
mem_map
Page lists: free, active and inactive
Page allocation and reclamation
Page reclamation watermarks

16

Red Hat Summit 2009 | John Shakshober / Larry Woodman

mem_map

Kernel maintains a page struct for each 4KB(16KB on IA64 and


64KB for PPC64/RHEL5) page of RAM
mem_map is the global array of page structs
Page struct size(x86, x86_64):
32-bit = 32bytes
64-bit = 56bytes

17

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Per-zone page lists


Active List - most recently referenced
Anonymous-stack, heap, bss
Pagecache-filesystem data/meta-data
Inactive List - least recently referenced
Dirty-modified
writeback in progress
Clean-ready to free
Free
Coalesced buddy allocator

18

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Per zone Free list/buddy allocator lists


Kernel maintains per-zone free list
Buddy allocator coalesces free pages into larger physically contiguous pieces
DMA
1*4kB4*8kB6*16kB4*32kB3*64kB1*128kB1*256kB1*512kB0*1024kB1*2048kB2*4096kB=11588kB)

Normal
217*4kB207*8kB1*16kB1*32kB0*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=3468kB)

HighMem
847*4kB409*8kB17*16kB1*32kB1*64kB1*128kB1*256kB1*512kB0*1024kB0*2048kB0*4096kB=7924kB)

Memoryallocationfailures
Freelist exhaustion.
Freelist fragmentation.

19

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Per NUMA-Node Resources

Memory zones(DMA & Normal zones)


CPUs
IO/DMA capacity
Interrupt processing
Page reclamation kernel thread(kswapd#)

20

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMA Nodes and Zones


64-bit
End of RAM

Node 1

Normal Zone

Normal Zone
4GB

Node 0

DMA32 Zone
16MB
DMA Zone
0

21

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Virtual Address Space Maps

32-bit
3G/1G address space
4G/4G address space(RHEL4 only)
64-bit
X86_64
IA64

22

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Linux 32-bit Address Spaces(SMP)


Virtual

0GB

3G/1G Kernel(SMP)

3GB

4GB

RAM

DMA Normal

23

Red Hat Summit 2009 | John Shakshober / Larry Woodman

HighMem

RHEL4 32-bit Address Space(Hugemem)

Virtual

4G/4G Kernel(Hugemem)
User(s)
Kernel

0 GB

3968MB

RAM

DMA

24

Normal

3968MB

Red Hat Summit 2009 | John Shakshober / Larry Woodman

HighMem

Linux 64-bit Address Space


x86_64
VIRT

User
0

Kernel
128TB(2^47)
RAM
IA64

VIRT

RAM

25

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Memory Pressure
32- bit
DMA

Normal

Highmem

Kernel Allocations

User Allocations

64- bit
DMA

26

Normal

Kernel and User Allocations

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Kernel Memory Pressure


Static Boot-time(DMA and Normal zones)
Kernel text, data, BSS
Bootmem allocator, tables and hashes(mem_map)
Dynamic
Slabcache(Normal zone)
Kernel data structs
Inode cache, dentry cache and buffer header dynamics
Pagetables(Highmem/Normal zone)
HughTLBfs(Highmem/Normal zone)

27

Red Hat Summit 2009 | John Shakshober / Larry Woodman

User Memory Pressure


Anonymous/pagecache split

Pagecache Allocations

Page Faults

pagecache

28

Red Hat Summit 2009 | John Shakshober / Larry Woodman

anonymous

PageCache/Anonymous memory split


Pagecache memory is global and grows when filesystem data is accessed until
memory is exhausted.
Pagecache is freed:
Underlying files are deleted.
Unmount of the filesystem.
Kswapd reclaims pagecache pages when memory is exhausted.
/proc/sys/vm/drop_caches
Anonymous memory is private and grows on user demmand
Allocation followed by pagefault.
Swapin.
Anonymous memory is freed:
Process unmaps anonymous region or exits.
Kswapd reclaims anonymous pages(swapout) when memory is exhausted

29

Red Hat Summit 2009 | John Shakshober / Larry Woodman

PageCache/Anonymous memory split


Balance between pagecache and anonymous memory.
Dynamic.
Controlled via:
/proc/sys/vm/pagecache.
/proc/sys/vm/swappiness.
Swap files/partitions.

30

Red Hat Summit 2009 | John Shakshober / Larry Woodman

32-bit Memory Reclamation


Kernel Allocations

DMA

Normal

Kernel Reclamation
(kswapd)
slapcache reaping
inode cache pruning
bufferhead freeing
dentry cache pruning

31

User Allocations

Highmem

User Reclamation
(kswapd/pdflush)
page aging
pagecache shrinking
swapping

Red Hat Summit 2009 | John Shakshober / Larry Woodman

64-bit Memory Reclamation

RAM

Kernel and User Allocations


Kernel and User Reclamation

32

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Anonymous/pagecache reclaiming
Pagecache Allocations

Page Faults

pagecache

kswapd(bdflush/pdflush, kupdated)
page reclaim
deletion of a file
unmount filesystem

33

anonymous

kswapd
page reclaim (swapout)
unmap
exit

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Per Node/Zone Paging Dynamics


User Allocations
Reactivate

INACTIVE
(Dirty -> Clean)

ACTIVE

Page aging

FREE

Reclaiming

swapout
pdflush(RHEL4/5)

User deletions

34

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Memory reclaim Watermarks


Free List
All of RAM
Do nothing

Pages High kswapd sleeps above High


kswapd reclaims memory

Pages Low kswapd wakesup at Low


kswapd reclaims memory

Pages Min all memory allocators reclaim at Min


user processes/kswapd reclaim memory
0

35

Red Hat Summit 2009 | John Shakshober / Larry Woodman

FileSystem&DiskIO
pagecache

Read()/Write()
memory copy

Pagecache
page

I/O

buffer

DMA

User space

Kernel
36

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Buffered file system write


pagecache

Memory copy

buffer

Pagecache
page(dirty)

User

Kernel

100% of pagecache RAM


dirty

pdflushd and
write()'ng processes
write dirty buffers

40% dirty) processes start


synchronous writes
pdflushd writes dirty
buffers in
background
10% dirty wakeup pdflushd
do_nothing
0% dirty

37

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Buffered file system read


Memory copy
Buffer
(dirty)

User

38

Pagecache
page

Kernel

Red Hat Summit 2009 | John Shakshober / Larry Woodman

DirectIOfilesystemread()/write()

Read()/Write()
DMA

buffer

User space

Pagecache
39

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Section2AnalyzingSystemPerformance
Performance Monitoring Tools
What to run under certain loads
Analyzing System Performance
What to look for

40

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Performance Monitoring Tools

Standard Unix OS tools

Monitoring - cpu, memory, process, disk

oprofile

Kernel Tools

/proc, info (cpu, mem, slab), dmesg, AltSysrq

Networking

Profiling

nmi_watchdog=1, profile=2

Tracing strace, ltrace

dprobe, kprobe

3rd party profiling/ capacity monitoring

41

Perfmon, Caliper, vtune

SARcheck, KDE, BEA Patrol, HP Openview


Red Hat Summit 2009 | John Shakshober / Larry Woodman

Red Hat Top Tools


CPU Tools

42

Memory Tools

Process Tools

1 top

1 top

1 top

2 vmstat

2 vmstat -s

2 ps -o pmem

3 ps aux

3 ps aur

3 gprof

4 mpstat -P all

4 ipcs

4 strace,ltrace

5 sar -u

5 sar -r -B -W

5 sar

6 iostat

6 free

7 oprofile

7 oprofile

1 iostat -x

8 gnome-

8 gnome-

2 vmstat - D

system-monitor

system-monitor

3 sar -DEV #

9 KDE-monitor

9 KDE-monitor

4 nfsstat

10 /proc

10 /proc

5 NEED MORE!

Disk Tools

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Monitoring Tools
mpstat reveals per cpu stats, Hard/Soft Interrupt usage
vmstat vm page info, context switch, total ints/s, cpu
netstat per nic status, errors, statistics at driver level
lspci

list the devices on pci, indepth driver flags

oprofile system level profiling, kernel/driver code


modinfo list information about drivers, version, options
sar

collect, report, save system activity information


Many others available- iptraf, wireshark, etc

Sample use for some of these embedded in talk

43

Red Hat Summit 2009 | John Shakshober / Larry Woodman

top - press h help,1-show cpus, m-memory, t-threads, > - column sort


top09:01:04up8days,15:22,2users,loadaverage:1.71,0.39,0.12
Tasks:114total,1running,113sleeping,0stopped,0zombie
Cpu0:5.3%us,2.3%sy,0.0%ni,0.0%id,92.0%wa,0.0%hi,0.3%si
Cpu1:0.3%us,0.3%sy,0.0%ni,89.7%id,9.7%wa,0.0%hi,0.0%si
Mem:2053860ktotal,2036840kused,17020kfree,99556kbuffers
Swap:2031608ktotal,160kused,2031448kfree,417720kcached
PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND
27830oracle1601315m1.2g1.2gD1.360.90:00.09oracle
27802oracle1601315m1.2g1.2gD1.061.00:00.10oracle
27811oracle1601315m1.2g1.2gD1.060.80:00.08oracle
27827oracle1601315m1.2g1.2gD1.061.00:00.11oracle
27805oracle1701315m1.2g1.2gD0.761.00:00.10oracle
27828oracle1502758466484620S0.30.30:00.17tpcc.exe
1root1604744580480S0.00.00:00.50init
2rootRT0000S0.00.00:00.11migration/0
3root3419000S0.00.00:00.00ksoftirqd/0

44

Red Hat Summit 2009 | John Shakshober / Larry Woodman

vmstat(paging vs swapping)
Vmstat10
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussywaid
200548352420052423457600546315251303096
020169784020052439314400057850482108539941221463
300784420052457841090059330589463243144307321842
Vmstat10
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussywaid
200548352420052423457600546315251303096
02016623402005242345760057850482108539941221463
3023567873842005242345761875423745193589463243144307321842

45

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Vmstat - IOzone(8GB file with 6GB RAM)


#!depletememoryuntilpdflushturnson
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussywaid
200448352420052423457600546315251303096
020169784020052429314400057850482108539941221463
3001537884200524384109200193589463243144307321842
02052812020052462281720047888810177133921322246
01046140200524671373600179110719144718251303535
22050972200524670574400232119698131619710253144
#!nowtransitionfromwritetoreads
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussywaid
14051040200524670554400213351912658390265618
1103506420052467127240040118911136720210354223
01068264234372664702000767445420484032072073
01034468234372667801600773913416202834091872
01047320234372669035600810507717832916072073
10038756234372669834400761364420273705191972
01031472234372670653200767253316012807081973

46

Red Hat Summit 2009 | John Shakshober / Larry Woodman

iostat -x of same IOzone EXT3 file system


Iostatmetrics
ratesperfsecsizesandresponsetime
r|wrqm/srequestmerged/saverqszaveragerequestsz
r|wsec/s512bytesectors/savequszaveragequeuesz
r|wKB/sKilobyte/sawaitaveragewaittimems
r|w/soperations/ssvcmaveservicetimem
Linux2.4.2127.0.2.ELsmp(node1)
avgcpu:%user%nice%sys%iowait%idle
0.400.002.630.9196.06
Device:rrqm/swrqm/sr/sw/srsec/swsec/srkB/swkB/savgrqszavgquszawaitsvctm%util
sdi16164.600.00523.400.00133504.000.0066752.000.00255.071.001.911.8898.40
sdi17110.100.00553.900.00141312.000.0070656.000.00255.120.991.801.7898.40
sdi16153.500.00522.500.00133408.000.0066704.000.00255.330.981.881.8697.00
sdi17561.900.00568.100.00145040.000.0072520.000.00255.311.011.781.76100.00

47

Red Hat Summit 2009 | John Shakshober / Larry Woodman

SAR
[root@localhostredhat]#saru33
Linux2.4.2120.EL(localhost.localdomain)05/16/2005

10:32:28PMCPU%user%nice%system%idle
10:32:31PMall0.000.000.00100.00
10:32:34PMall1.330.000.3398.33
10:32:37PMall1.340.000.0098.66
Average:all0.890.000.1199.00
[root]sarnDEV
Linux2.4.2120.EL(localhost.localdomain)03/16/2005

01:10:01PMIFACErxpck/stxpck/srxbyt/stxbyt/srxcmp/s
txcmp/srxmcst/s
01:20:00PMlo3.493.49306.16306.160.00
0.000.00
01:20:00PMeth03.893.532395.34484.700.00
0.000.00
01:20:00PMeth10.000.000.000.000.00
0.000.00

48

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Networking tools
Tuning tools
ethtool

View and change Ethernet card settings

sysctl

View and set /proc/sys settings

ifconfig View and set ethX variables


setpci

View and set pci bus params for device

netperf Can run a bunch of different network tests


/proc

49

OS info, place for changing device tunables

Red Hat Performance NDA Required 2009

Red Hat Summit 2009 | John Shakshober / Larry Woodman

ethtool
Works mostly at the HW level
ethtool -S provides HW level stats
Counters since boot time, create scripts to calculate diffs
ethtool -c - Interrupt coalescing
ethtool -g - provides ring buffer information
ethtool -k - provides hw assist information
ethtool -i - provides the driver information

50

Red Hat Performance NDA Required 2009

Red Hat Summit 2009 | John Shakshober / Larry Woodman

CPU Utilization Raw vs. Tuned IRQ, NAPI


NotTuned
CPU%user%nice%system%iowait%irq%soft%idleintr/s
all0.230.008.010.020.0010.7880.9621034.49
00.000.000.000.010.0052.1647.8320158.58
10.000.000.000.020.000.00100.00125.14
20.000.000.000.080.000.0099.93125.14
30.000.000.000.030.000.0099.99125.13
41.790.0064.110.000.0034.110.01125.14
50.010.000.000.020.000.0099.99125.14
60.000.000.000.000.000.00100.01125.14
70.000.000.000.020.000.0099.99125.14
WithTuning
CPU%user%nice%system%iowait%irq%soft%idleintr/s
all0.260.0010.440.000.0012.5076.791118.61
00.000.000.000.000.000.00100.001.12
10.010.000.000.000.000.0099.990.00
20.000.000.000.000.000.00100.000.00
30.000.000.000.000.000.00100.000.00
42.080.0083.540.000.000.0014.380.00
50.000.000.010.000.00100.000.001.95
60.000.000.000.000.000.0299.980.68
70.000.000.000.000.030.0099.981114.86
Red Hat Performance NDA Required 2009

free/numastat memory allocation


[root@localhostredhat]#freel
totalusedfreesharedbuffers
cached
Mem:511368342336169032029712
167408
Low:51136834233616903200
0
High:00000
0
/+buffers/cache:145216366152
Swap:104324001043240
numastat(on2cpux86_64basedsystem)
node1node0
numa_hit980333210905630
numa_miss20490181609361
numa_foreign16093612049018
interleave_hit5868954749
local_node977092710880901
other_node20814231634090
52

Red Hat Summit 2009 | John Shakshober / Larry Woodman

ps

[root@localhostroot]#psaux
[root@localhostroot]#psaux|more
USERPID%CPU%MEMVSZRSSTTYSTATSTARTTIMECOMMAND
root10.10.11528516?S23:180:04init
root20.00.000?SW23:180:00[keventd]
root30.00.000?SW23:180:00[kapmd]
root40.00.000?SWN23:180:00[ksoftirqd/0]
root70.00.000?SW23:180:00[bdflush]
root50.00.000?SW23:180:00[kswapd]
root60.00.000?SW23:180:00[kscand]

53

Red Hat Summit 2009 | John Shakshober / Larry Woodman

pstree
init/usr/bin/sealer
acpid
atd
auditdpython

{auditd}
automount6*[{automount}]
avahi-daemonavahi-daemon
bonobo-activati{bonobo-activati}
bt-applet
clock-applet
crond
cupsdcups-polld
3*[dbus-daemon{dbus-daemon}]
2*[dbus-launch]
dhclient

54

Red Hat Summit 2009 | John Shakshober / Larry Woodman

The /proc filesystem


/proc
meminfo
slabinfo
cpuinfo
pid<#>/maps
vmstat(RHEL4 & RHEL5)
zoneinfo(RHEL5)
sysrq-trigger

55

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/meminfo
RHEL4> cat /proc/meminfo
MemTotal: 32749568 kB
MemFree:
31313344 kB
Buffers:
29992 kB
Cached:
1250584 kB
SwapCached:
0 kB
Active:
235284 kB
Inactive:
1124168 kB
HighTotal:
0 kB
HighFree:
0 kB
LowTotal: 32749568 kB
LowFree:
31313344 kB
SwapTotal: 4095992 kB
SwapFree:
4095992 kB
Dirty:
0 kB
Writeback:
0 kB
Mapped:
1124080 kB
Slab:
38460 kB
CommitLimit: 20470776 kB
Committed_AS: 1158556 kB
PageTables:
5096 kB
VmallocTotal: 536870911 kB
VmallocUsed:
2984 kB
VmallocChunk: 536867627 kB
HugePages_Total: 0
HugePages_Free:
0
Hugepagesize: 2048 kB

56

RHEL5> cat /proc/meminfo


MemTotal:
1025220 kB
MemFree:
11048 kB
Buffers:
141944 kB
Cached:
342664 kB
SwapCached:
4 kB
Active:
715304 kB
Inactive:
164780 kB
HighTotal:
0 kB
HighFree:
0 kB
LowTotal:
1025220 kB
LowFree:
11048 kB
SwapTotal: 2031608 kB
SwapFree:
2031472 kB
Dirty:
84 kB
Writeback:
0 kB
AnonPages:
395572 kB
Mapped:
82860 kB
Slab:
92296 kB
PageTables:
23884 kB
NFS_Unstable:
0 kB
Bounce:
0 kB
CommitLimit: 2544216 kB
Committed_AS: 804656 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 263472 kB
VmallocChunk: 34359474711 kB
HugePages_Total: 0
HugePages_Free:
0
HugePages_Rsvd:
0
Hugepagesize: 2048 kB

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/slabinfo
slabinfoversion:2.1
#name<active_objs><num_objs><objsize><objperslab><pagesperslab>:tunables<limit>
<batchcount><sharedfactor>:slabdata<active_slabs><num_slabs><sharedavail>
nfsd4_delegations0065661:tunables54278:slabdata000
nfsd4_stateids00128301:tunables120608:slabdata000
nfsd4_files0072531:tunables120608:slabdata000
nfsd4_stateowners0042491:tunables54278:slabdata000
nfs_direct_cache00128301:tunables120608:slabdata000
nfs_write_data363683292:tunables54278:slabdata440
nfs_read_data323576851:tunables54278:slabdata770
nfs_inode_cache13831389104031:tunables24128:slabdata4634630
nfs_page00128301:tunables120608:slabdata000
fscache_cookie_jar35372531:tunables120608:slabdata110
ip_conntrack_expect00136281:tunables120608:slabdata000
ip_conntrack75130304131:tunables54278:slabdata10100
bridge_fdb_cache0064591:tunables120608:slabdata000
rpc_buffers88204821:tunables24128:slabdata440
rpc_tasks3030384101:tunables54278:slabdata330

57

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/cpuinfo
[lwoodman]$cat/proc/cpuinfo
processor:0
vendor_id:GenuineIntel
cpufamily:6
model:15
modelname:Intel(R)Xeon(R)CPU3060@2.40GHz
stepping:6
cpuMHz:2394.070
cachesize:4096KB
physicalid:0
siblings:2
coreid:0
cpucores:2
fpu:yes
fpu_exception:yes
cpuidlevel:10
wp:yes
flags:fpuvmedepsetscmsrpaemcecx8apicsepmtrrpgemcacmovpatpse36clflush
dtsacpimmxfxsrssesse2sshttmsyscallnxlmconstant_tscpnimonitords_cplvmxesttm2cx16
xtprlahf_lm
bogomips:4791.41
clflushsize:64
cache_alignment:64
addresssizes:36bitsphysical,48bitsvirtual
powermanagement:

58

Red Hat Summit 2009 | John Shakshober / Larry Woodman

32-bit /proc/<pid>/maps
[root@dhcp8336proc]#cat5808/maps
0022e0000023b000rxp0000000003:034137068/lib/tls/libpthread0.60.so
0023b0000023c000rwp0000c00003:034137068/lib/tls/libpthread0.60.so
0023c0000023e000rwp0000000000:000
0037f00000391000rxp0000000003:03523285/lib/libnsl2.3.2.so
0039100000392000rwp0001100003:03523285/lib/libnsl2.3.2.so
0039200000394000rwp0000000000:000
00c4500000c5a000rxp0000000003:03523268/lib/ld2.3.2.so
00c5a00000c5b000rwp0001500003:03523268/lib/ld2.3.2.so
00e5c00000f8e000rxp0000000003:034137064/lib/tls/libc2.3.2.so
00f8e00000f91000rwp0013100003:034137064/lib/tls/libc2.3.2.so
00f9100000f94000rwp0000000000:000
080480000804f000rxp0000000003:031046791/sbin/ypbind
0804f00008050000rwp0000700003:031046791/sbin/ypbind
09794000097b5000rwp0000000000:000
b5fdd000b5fde000p0000000000:000

59

Red Hat Summit 2009 | John Shakshober / Larry Woodman

64-bit /proc/<pid>/maps
#cat/proc/2345/maps
004000000100b000rxp00000000fd:001933328/usr/sybase/ASE12_5/bin/dataserver.esd3
0110b00001433000rwp00c0b000fd:001933328/usr/sybase/ASE12_5/bin/dataserver.esd3
01433000014eb000rwxp0143300000:000
4000000040001000p4000000000:000
4000100040a01000rwxp4000100000:000
2a95f730002a96073000p0012b000fd:00819273/lib64/tls/libc2.3.4.so
2a960730002a96075000rp0012b000fd:00819273/lib64/tls/libc2.3.4.so
2a960750002a96078000rwp0012d000fd:00819273/lib64/tls/libc2.3.4.so
2a960780002a9607e000rwp2a9607800000:000
2a9607e0002a98c3e000rws0000000000:06360450/SYSV0100401e(deleted)
2a98c3e0002a98c47000rwp2a98c3e00000:000
2a98c470002a98c51000rxp00000000fd:00819227/lib64/libnss_files2.3.4.so
2a98c510002a98d51000p0000a000fd:00819227/lib64/libnss_files2.3.4.so
2a98d510002a98d53000rwp0000a000fd:00819227/lib64/libnss_files2.3.4.so
2a98d530002a98d57000rxp00000000fd:00819225/lib64/libnss_dns2.3.4.so
2a98d570002a98e56000p00004000fd:00819225/lib64/libnss_dns2.3.4.so
2a98e560002a98e58000rwp00003000fd:00819225/lib64/libnss_dns2.3.4.so
2a98e580002a98e69000rxp00000000fd:00819237/lib64/libresolv2.3.4.so
2a98e690002a98f69000p00011000fd:00819237/lib64/libresolv2.3.4.so
2a98f690002a98f6b000rwp00011000fd:00819237/lib64/libresolv2.3.4.so
2a98f6b0002a98f6d000rwp2a98f6b00000:000
35c7e0000035c7e08000rxp00000000fd:00819469/lib64/libpam.so.0.77
35c7e0800035c7f08000p00008000fd:00819469/lib64/libpam.so.0.77
35c7f0800035c7f09000rwp00008000fd:00819469/lib64/libpam.so.0.77
35c800000035c8011000rxp00000000fd:00819468/lib64/libaudit.so.0.0.0
35c801100035c8110000p00011000fd:00819468/lib64/libaudit.so.0.0.0
35c811000035c8118000rwp00010000fd:00819468/lib64/libaudit.so.0.0.0
35c900000035c900b000rxp00000000fd:00819457/lib64/libgcc_s3.4.420050721.so.1
35c900b00035c910a000p0000b000fd:00819457/lib64/libgcc_s3.4.420050721.so.1
35c910a00035c910b000rwp0000a000fd:00819457/lib64/libgcc_s3.4.420050721.so.1
7fbfff10007fc0000000rwxp7fbfff100000:000
ffffffffff600000ffffffffffe00000p0000000000:000

60

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/vmstat
cat /proc/vmstat
nr_anon_pages 98893
nr_mapped 20715
nr_file_pages 120855
nr_slab 23060
nr_page_table_pages
5971
nr_dirty 21
nr_writeback 0
nr_unstable 0
nr_bounce 0
numa_hit 996729666
numa_miss 0
numa_foreign 0
numa_interleave 87657
numa_local 996729666
numa_other 0
pgpgin 2577307
pgpgout 106131928
pswpin 0
pswpout 34
pgalloc_dma 198908
pgalloc_dma32
997707549
pgalloc_normal 0
pgalloc_high 0
pgfree 997909734
pgactivate 1313196
pgdeactivate 470908
pgfault 2971972147
pgmajfault 8047.
61

CONTINUED...
pgrefill_dma 18338
pgrefill_dma32 1353451
pgrefill_normal 0
pgrefill_high 0
pgsteal_dma 0
pgsteal_dma32 0
pgsteal_normal 0
pgsteal_high 0
pgscan_kswapd_dma 7235
pgscan_kswapd_dma32 417984
pgscan_kswapd_normal 0
pgscan_kswapd_high 0
pgscan_direct_dma 12
pgscan_direct_dma32 1984
pgscan_direct_normal 0
pgscan_direct_high 0
pginodesteal 166
slabs_scanned 1072512
kswapd_steal 410973
kswapd_inodesteal 61305
pageoutrun 7752
allocstall 29
pgrotated 73

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Alt Sysrq M
Freepages:15809760kB(0kBHighMem)
Active:51550inactive:54515dirty:44writeback:0unstable:0free:3952440slab:8727mapped
file:5064mappedanon:20127pagetables:1627
Node0DMAfree:10864kBmin:8kBlow:8kBhigh:12kBactive:0kBinactive:0kBpresent:10460kB
pages_scanned:0all_unreclaimable?no
Node0DMA32free:2643124kBmin:2760kBlow:3448kBhigh:4140kBactive:0kBinactive:0kB
present:2808992kBpages_scanned:0all_unreclaimable?no
Node0Normalfree:13155772kBmin:13480kBlow:16848kBhigh:20220kBactive:206200kB
inactive:218060kBpresent:13703680kBpages_scanned:0all_unreclaimable?no
Node0HighMemfree:0kBmin:128kBlow:128kBhigh:128kBactive:0kBinactive:0kBpresent:0kB
pages_scanned:0all_unreclaimable?no
Node0DMA:4*4kB2*8kB3*16kB1*32kB2*64kB1*128kB1*256kB0*512kB2*1024kB0*2048kB
2*4096kB=10864kB
Node0DMA32:1*4kB0*8kB1*16kB1*32kB0*64kB1*128kB0*256kB2*512kB2*1024kB3*2048kB
643*4096kB=2643124kB
Node0Normal:453*4kB161*8kB44*16kB15*32kB4*64kB4*128kB0*256kB1*512kB0*1024kB
1*2048kB3210*4096kB=13155772kB
Node0HighMem:empty
85955pagecachepages
Swapcache:add0,delete0,find0/0,race0+0
Freeswap=2031608kB
Totalswap=2031608kB
Freeswap:2031608kB
4521984pagesofRAM
446612reservedpages
21971pagesshared
0pagesswapcached
62

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Alt Sysrq M - NUMA


Freepages:15630596kB(0kBHighMem)
Active:77517inactive:67928dirty:1000writeback:0unstable:0free:3907649slab:10391mappedfile:8975
mappedanon:38003pagetables:4731
Node0DMAfree:10864kBmin:8kBlow:8kBhigh:12kBactive:0kBinactive:0kBpresent:10460kBpages_scanned:0
all_unreclaimable?no
lowmem_reserve[]:0274380458045
Node0DMA32free:2643480kBmin:2760kBlow:3448kBhigh:4140kBactive:0kBinactive:0kBpresent:2808992kB
pages_scanned:0all_unreclaimable?no
Node0Normalfree:4917364kBmin:5340kBlow:6672kBhigh:8008kBactive:204836kBinactive:197340kB
present:5429760kBpages_scanned:0all_unreclaimable?no
Node0HighMemfree:0kBmin:128kBlow:128kBhigh:128kBactive:0kBinactive:0kBpresent:0kB
pages_scanned:0all_unreclaimable?no
Node1DMAfree:0kBmin:0kBlow:0kBhigh:0kBactive:0kBinactive:0kBpresent:0kBpages_scanned:0
all_unreclaimable?no
Node1DMA32free:0kBmin:0kBlow:0kBhigh:0kBactive:0kBinactive:0kBpresent:0kBpages_scanned:0
all_unreclaimable?no
Node1Normalfree:8058888kBmin:8140kBlow:10172kBhigh:12208kBactive:105232kBinactive:74372kB
present:8273920kBpages_scanned:0all_unreclaimable?no
Node1HighMemfree:0kBmin:128kBlow:128kBhigh:128kBactive:0kBinactive:0kBpresent:0kB
pages_scanned:0all_unreclaimable?no
Node0DMA:6*4kB5*8kB3*16kB2*32kB3*64kB2*128kB0*256kB0*512kB2*1024kB0*2048kB2*4096kB=10864kB
Node0DMA32:2*4kB2*8kB0*16kB2*32kB1*64kB1*128kB1*256kB2*512kB2*1024kB3*2048kB643*4096kB=
2643480kB
Node0Normal:91*4kB47*8kB27*16kB5*32kB5*64kB0*128kB0*256kB1*512kB2*1024kB1*2048kB1199*4096kB
=4917364kB
Node1Normal:78*4kB48*8kB477*16kB326*32kB261*64kB105*128kB55*256kB33*512kB20*1024kB0*2048kB
1943*4096kB=8058888kB
107476pagecachepages
4521984pagesofRAM

63

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Alt Sysrq T
gdmgreeterSffff8100090368000751174837489(NOTLB)
ffff81044ae05b38000000000000008200000000000000800000000000000000
0000000000000000000000000000000affff810432ed97a0ffff81010f387080
0000002a3a0d43980000000000003b57ffff810432ed99880000000600000000
CallTrace:
[<ffffffff8006380f>]schedule_timeout+0x1e/0xad
[<ffffffff80049b33>]add_wait_queue+0x24/0x34
[<ffffffff8002db7e>]pipe_poll+0x2d/0x90
[<ffffffff8002f764>]do_sys_poll+0x277/0x360
[<ffffffff8001e99c>]__pollwait+0x0/0xe2
[<ffffffff8008be44>]default_wake_function+0x0/0xe
[<ffffffff8008be44>]default_wake_function+0x0/0xe
[<ffffffff8008be44>]default_wake_function+0x0/0xe
[<ffffffff80012f1a>]sock_def_readable+0x34/0x5f
[<ffffffff8004a81a>]unix_stream_sendmsg+0x281/0x346
[<ffffffff80037c3a>]do_sock_write+0xc6/0x102
[<ffffffff801277da>]avc_has_perm+0x43/0x55
[<ffffffff80276a6e>]unix_ioctl+0xc7/0xd0
[<ffffffff8021f48f>]sock_ioctl+0x1c1/0x1e5
[<ffffffff800420a7>]do_ioctl+0x21/0x6b
[<ffffffff800302a0>]vfs_ioctl+0x457/0x4b9
[<ffffffff800b6193>]audit_syscall_entry+0x180/0x1b3
[<ffffffff8004c4f6>]sys_poll+0x2d/0x34
[<ffffffff8005d28d>]tracesys+0xd5/0xe0

64

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Alt Sysrq W and P


SysRq:ShowCPUs
CPU2:
ffff81010f30bf480000000000000000ffff81010f305e20ffffffff801ae69e
00000000000000000000000000000200ffffffff803ea2a0ffffffff801ae6cd
ffffffff801ae69effffffff80022d85ffffffff8019739300000000000000ff
CallTrace:
<IRQ>[<ffffffff801ae69e>]showacpu+0x0/0x3b
[<ffffffff801ae6cd>]showacpu+0x2f/0x3b
[<ffffffff801ae69e>]showacpu+0x0/0x3b
[<ffffffff80022d85>]smp_call_function_interrupt+0x57/0x75
[<ffffffff80197393>]acpi_processor_idle+0x0/0x463
[<ffffffff8005dc22>]call_function_interrupt+0x66/0x6c
<EOI>[<ffffffff80197324>]acpi_safe_halt+0x25/0x36
[<ffffffff8019751a>]acpi_processor_idle+0x187/0x463
[<ffffffff80197395>]acpi_processor_idle+0x2/0x463
[<ffffffff80197393>]acpi_processor_idle+0x0/0x463
[<ffffffff80197393>]acpi_processor_idle+0x0/0x463
[<ffffffff80049399>]cpu_idle+0x95/0xb8
[<ffffffff80076e12>]start_secondary+0x45a/0x469

65

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Profiling Tools: OProfile


Open source project
http://oprofile.sourceforge.net
Events to measure with Oprofile:

Upstream; Red Hat contributes


Originally modeled after DEC Continuous
Profiling Infrastructure (DCPI)

Initially time-based samples most useful:


PPro/PII/PIII/AMD: CPU_CLK_UNHALTED

System-wide profiler (both kernel and user


code)

P4: GLOBAL_POWER_EVENTS

Sample-based profiler with SMP machine


support

TIMER_INT (fall-back profiling mechanism) default

Performance monitoring hardware support


Relatively low overhead, typically <10%

Processor specific performance monitoring


hardware can provide additional kinds of sampling
Many events to choose from
Branch mispredictions

Designed to run for long times

Cache misses - TLB misses

Included in base Red Hat Enterprise Linux


product

66

IA64: CPU_CYCLES

Pipeline stalls/serializing instructions

Red Hat Summit 2009 | John Shakshober / Larry Woodman


Red Hat Confidential

oprofile builtin to RHEL4 & 5 (smp)


opcontrol on/off data

opreport analyze profile

--start start collection

-r reverse order sort

--stop stop collection


--dump output to disk

-t [percentage] theshold to
view

--event=:name:count

-f /path/filename

Example:

-d details

# opcontrol start

opannotate

# /bin/time test1 &

-s /path/source

# sleep 60

-a /path/assembly

# opcontrol stop
# opcontrol dump

67

Red Hat Summit 2009 | John Shakshober / Larry Woodman

oprofile opcontrol and opreport cpu_cycles


#CPU:Core2,speed2666.72MHz(estimated)
CountedCPU_CLK_UNHALTEDevents(Clockcycleswhennothalted)withaunitmaskof0x00(Unhaltedcore
c
ycles)count100000
CPU_CLK_UNHALT...|
samples|%|

39743597184.6702vmlinux
197030644.1976zeus.web
169143173.6034e1000
122085142.6009ld2.5.so
117117462.4951libc2.5.so
51646641.1003sim.cgi
23334270.4971oprofiled
12951610.2759oprofile
10997310.2343zeus.cgi
9686230.2064ext3
2701630.0576jbd

68

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Profiling Tools: SystemTap


Technology: Kprobes:
In current 2.6 kernels
Upstream 2.6.12, backported to RHEL4 kernel
Kernel instrumentation without recompile/reboot
Uses software int and trap handler for instrumentation
Debug information:
Provides map between executable and source code
Generated as part of RPM builds
Available at: ftp://ftp.redhat.com
Safety: Instrumentation scripting language:
No dynamic memory allocation or assembly/C code
Types and type conversions limited
Restrict access through pointers
Script compiler checks:
Infinite loops and recursion Invalid variable access

69

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Profiling Tools: SystemTap


parse

probe script

Red Hat, Intel, IBM & Hitachi collaboration


Linux answer to Solaris Dtrace

elaborate

Dynamic instrumentation
Tool to take a deep look into a running system:
Assists in identifying causes of performance problems

probe-set library

translate to C, compile *

Simplifies building instrumentation


Current snapshots available from:
http://sources.redhat.com/systemtap
Source for presentations/papers
Kernel space tracing today, user space tracing under
development
Technology preview status until 5.1

70

load module, start probe

probe kernel
object

extract output, unload


probe output

* Solaris Dtrace is interpretive

Red Hat Summit 2009 | John Shakshober / Larry Woodman

SystemTap: Kernel debugging


Several tracepoints were added to RHEL5 kernel
trace_mm_filemap_fault(area->vm_mm, address, page);
trace_mm_anon_userfree(mm, addr, page);
trace_mm_filemap_userunmap(mm, addr, page);
trace_mm_filemap_cow(mm, address, new_page);
trace_mm_anon_cow(mm, address, new_page);
trace_mm_anon_pgin(mm, address, page);
trace_mm_anon_fault(mm, address, page);
trace_mm_page_free(page);
trace_mm_page_allocation(page, zone->free_pages);
trace_mm_pdflush_bgwriteout(_min_pages);
trace_mm_pdflush_kupdate(nr_to_write);
trace_mm_anon_unmap(page, ret == SWAP_SUCCESS);
trace_mm_filemap_unmap(page, ret == SWAP_SUCCESS);
trace_mm_pagereclaim_pgout(page, PageAnon(page));
trace_mm_pagereclaim_free(page, PageAnon(page));
trace_mm_pagereclaim_shrinkinactive_i2a(page);
trace_mm_pagereclaim_shrinkinactive_i2i(page);
trace_mm_pagereclaim_shrinkinactive(nr_reclaimed);
trace_mm_pagereclaim_shrinkactive_a2a(page);
trace_mm_pagereclaim_shrinkactive_a2i(page);
trace_mm_pagereclaim_shrinkactive(pgscanned);
trace_mm_pagereclaim_shrinkzone(nr_reclaimed);
trace_mm_directreclaim_reclaimall(priority);
trace_mm_kswapd_runs(sc.nr_reclaimed);

71

Red Hat Summit 2009 | John Shakshober / Larry Woodman

SystemTap: Kernel debugging


Several custom scripts enable/use tracepoints
(/usr/local/share/doc/systemtap/examples)
#! /usr/local/bin/stap
global traced_pid
function log_event:long ()
{
return (!traced_pid ||traced_pid == (task_pid(task_current())))
}
probe kernel.trace("mm_pagereclaim_shrinkinactive") {
if (!log_event()) next
reclaims[pid()]++
command[pid()]=execname()
}
//MM kernel tracepoints prolog and epilog routines
probe begin {
printf("Starting mm tracepoints\n");
traced_pid = target();
if (traced_pid) {
printf("mode Specific Pid, traced pid: %d\n", traced_pid);
} else {
printf("mode - All Pids\n");
}
printf("\n");
}
probe end {
printf("Terminating mm tracepoints\n");
printf("Command
Pid
Direct Activate Deactivate Reclaims Freed\n");
printf("-------------- -------- ---------- -------- -----\n");
foreach (pid in reclaims-)

72

Red Hat Summit 2009 | John Shakshober / Larry Woodman

SystemTap: Kernel debugging


CommandPidDirectActivateDeactivateReclaimsFreed

kswapd05440150376791943715157430730
kswapd15450180678882434712117341408
memory254359975697573083604621115837
mixer_applet2768764180101333981
Xorg749151906283920382
gnometerminal71612103869512320
gnometerminal77015261422457172
cupsd7100192704128

73

Red Hat Summit 2009 | John Shakshober / Larry Woodman

SystemTap: Kernel debugging


CommandPidAllocFreeA_faultA_ufreeA_pginA_cowA_unmap

memory25685284278440644082834840398981614048185
kswapd1545300753257000049884
kswapd054462025241000017568
mixer_applet27687302282700101241
sshd25051227000600
kjournald86320728300002149
Xorg74911698980000310
gnomepowerman76531520001800
avahidaemon7252150128000480160
irqbalance67251263641313180190
bash250531220001300
hald7264890008300
gconfd271638252600680116

74

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Red Hat MRG Tuna

75

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Red Hat MRG Tuna con't

76

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Red Hat MRG Tuna

77

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Red Hat MRG Tuna


TUNA command line
Usage: tuna [OPTIONS]
-h, --help
Give this help list
-g, --gui
Start the GUI
-c, --cpus=CPU-LIST
CPU-LIST affected by commands
-C, --affect_children
Operation will affect children threads
-f, --filter
Display filter the selected entities
-i, --isolate
Move all threads away from CPU-LIST
-I, --include
Allow all threads to run on CPU-LIST
-K, --no_kthreads
Operations will not affect kernel threads
-m, --move
move selected entities to CPU-LIST
-p, --priority=[POLICY]:RTPRIO
set thread scheduler POLICY and RTPRIO
-P, --show_threads
show thread list
-s, --save=FILENAME
save kthreads sched tunables to FILENAME
-S, --sockets=CPU-SOCKET-LIST CPU-SOCKET-LIST affected by commands
-t, --threads=THREAD-LIST THREAD-LIST affected by commands
-U, --no_uthreads
Operations will not affect user threads
-W, --what_is
Provides help about selected entities
Examples
tuna -c 0-3 -i (isolate cpu 0-3), tune -S 1 i (isolate socket 1 = cpu 0-3 intelq)
tuna -t PID -C -p fifo:50 -S 1 -m -P (move PID# to socket 1, sched:fifo +50 prior

78

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Section 3: Tuning RHEL


How to tune Linux
Capacity tuning
Fix problems by adding resources
Performance Tuning
Methodology
1) Document config
2) Baseline results
3) While results non-optimal
a) Monitor/Instrument system/workload
b) Apply tuning 1 change at a time
c) Analyze results, exit or loop
4) Document final config

79

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Tuning - setting kernel parameters


/proc
[root@foobar fs]# cat /proc/sys/kernel/sysrq (see 0)
[root@foobar fs]# echo 1 > /proc/sys/kernel/sysrq
[root@foobar fs]# cat /proc/sys/kernel/sysrq (see 1)
Sysctl command
[root@foobar fs]# sysctl kernel.sysrq
kernel.sysrq = 0
[root@foobar fs]# sysctl -w kernel.sysrq=1
kernel.sysrq = 1
[root@foobar fs]# sysctl kernel.sysrq
kernel.sysrq = 1
Edit the /etc/sysctl.conf file
# Kernel sysctl configuration file for Red Hat Linux
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 1

80

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Capacity Tuning
Memory

/proc/sys/vm/overcommit_memory

/proc/sys/vm/overcommit_ratio

/proc/sys/vm/max_map_count

/proc/sys/vm/nr_hugepages

/proc/sys/kernel/msgmax

/proc/sys/kernel/msgmnb

/proc/sys/kernel/msgmni

/proc/sys/kernel/shmall

/proc/sys/kernel/shmmax

/proc/sys/kernel/shmmni

/proc/sys/kernel/threads-max

Kernel

Filesystems

/proc/sys/fs/aio_max_nr

/proc/sys/fs/file_max

OOM kills

81

Red Hat Summit 2009 | John Shakshober / Larry Woodman

OOM kills lowmem consumption


Freepages:9003696kB(8990400kBHighMem)
Active:323264inactive:346882dirty:327575writeback:3686unstable:0free:2250924slab:177094
mapped:15855pagetables:987
DMAfree:12640kBmin:16kBlow:32kBhigh:48kBactive:0kBinactive:0kBpresent:16384kB
pages_scanned:149all_unreclaimable?yes
protections[]:000
Normalfree:656kBmin:928kBlow:1856kBhigh:2784kBactive:6976kBinactive:9976kBpresent:901120kB
pages_scanned:28281all_unreclaimable?yes
protections[]:000
HighMemfree:8990400kBmin:512kBlow:1024kBhigh:1536kBactive:1286080kBinactive:1377552kB
present:12451840kBpages_scanned:0all_unreclaimable?no
protections[]:000
DMA:4*4kB4*8kB3*16kB4*32kB4*64kB1*128kB1*256kB1*512kB1*1024kB1*2048kB2*4096kB=
12640kB
Normal:0*4kB2*8kB0*16kB0*32kB0*64kB1*128kB0*256kB1*512kB0*1024kB0*2048kB0*4096kB=
656kB
HighMem:15994*4kB17663*8kB11584*16kB8561*32kB8193*64kB1543*128kB69*256kB2101*512kB
1328*1024kB765*2048kB875*4096kB=8990400kB
Swapcache:add0,delete0,find0/0,race0+0
Freeswap:8385912kB
3342336pagesofRAM
2916288pagesofHIGHMEM
224303reservedpages
666061pagesshared
0pagesswapcached
OutofMemory:Killedprocess22248(httpd).
oomkiller:gfp_mask=0xd0

82

Red Hat Summit 2009 | John Shakshober / Larry Woodman

OOM kills IO system stall


Freepages:15096kB(1664kBHighMem)Active:34146inactive:1995536dirty:255
writeback:314829unstable:0free:3774slab:39266mapped:31803pagetables:820
DMAfree:12552kBmin:16kBlow:32kBhigh:48kBactive:0kBinactive:0kBpresent:16384kB
pages_scanned:2023all_unreclaimable?yes
protections[]:000
Normalfree:880kBmin:928kBlow:1856kBhigh:2784kBactive:744kBinactive:660296kB
present:901120kBpages_scanned:726099all_unreclaimable?yes
protections[]:000
HighMemfree:1664kBmin:512kBlow:1024kBhigh:1536kBactive:135840kBinactive:7321848kB
present:7995388kBpages_scanned:0all_unreclaimable?no
protections[]:000
DMA:2*4kB4*8kB2*16kB4*32kB3*64kB1*128kB1*256kB1*512kB1*1024kB1*2048kB2*4096kB=
12552kB
Normal:0*4kB18*8kB14*16kB0*32kB0*64kB0*128kB0*256kB1*512kB0*1024kB0*2048kB
0*4096kB=880kB
HighMem:6*4kB9*8kB66*16kB0*32kB0*64kB0*128kB0*256kB1*512kB0*1024kB0*2048kB
0*4096kB=1664kB
Swapcache:add856,delete599,find341/403,race0+0
0bouncebufferpages
Freeswap:4193264kB

2228223pagesofRAM
1867481pagesofHIGHMEM
150341reservedpages
343042pagesshared
257pagesswapcached
kernel:OutofMemory:Killedprocess3450(hpsmhd).

83

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Eliminating OOMkills

RHEL4

/proc/sys/vm/oom-kill oom kill enable/disable flag(default 1).

/proc/<pid>/oom_adj per-process OOM adjustment(-17 to +15)

Set to -17 to disable that process from being OOM killed

Decrease to decrease OOM kill likelyhood.

Increase to increase OOM kill likelyhood.

/proc/<pid>/oom_score current OOM kill priority.

RHEL5

84

Red Hat Summit 2009 | John Shakshober / Larry Woodman

General Performance Tuning Considerations

Over Committing RAM


Swap device location
Storage device and limits limits
Kernel selection

85

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Performance Tuning
Kernel Selection
VM tuning
Processor related tuning
NUMA related tuning
Disk & IO tuning
Hugepages
KVM host and guests

86

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL4 kernel selection


x86

Standard kernel(no PAE, 3G/1G)


UP systems with <= 4GB RAM
SMP kernel(PAE, 3G/1G)

SMP systems with < ~16GB RAM


Highmem/Lowmem ratio <= 16:1

Hugemem kernel(PAE, 4G/4G)


SMP systems > ~16GB RAM

X86_64

87

Standard kernel for UP systems

SMP kernel for systems with up to 8 CPUs

LargeSMP kernel for systems up to 512 CPUs

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL5 kernel selection


x86

Standard kernel(no PAE, 3G/1G)

UP and SMP systems with <= 4GB RAM

PAE kernel(PAE, 3G/1G)

UP and SMP systems with >4GB RAM

X86_64

Standard kernel for all systems

Standard kernel for all systems

IA64

88

Red Hat Summit 2009 | John Shakshober / Larry Woodman

VM: swappiness
Controls how aggressively the system reclaims
mapped memory:
Anonymous memory - swapping
Mapped file pages writing if dirty and freeing
System V shared memory - swapping

Decreasing: more aggressive reclaiming of unmapped


pagecache memory
Increasing: more aggressive swapping of mapped
memory

89

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/sys/vm/swappiness
Sybaseserverwith/proc/sys/vm/swappinesssetto60(default)

procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwa
51643644267883544323417888801204044749613022084625342516

Sybaseserverwith/proc/sys/vm/swappinesssetto10

procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwa
8302422867243228069600238886377612862002024381326

90

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/sys/vm/min_free_kbytes
Directly controls the page reclaim watermarks in KB
# echo 1024 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:8kB low:8kB high:12kB
Node 0 DMA32 free:14456kB min:1012kB low:1264kB high:1516kB
----------------------------------------------------------echo 2048 > /proc/sys/vm/min_free_kbytes
----------------------------------------------------------Node 0 DMA free:4420kB min:20kB low:24kB high:28kB
Node 0 DMA32 free:14456kB min:2024kB low:2528kB high:3036kB
-----------------------------------------------------------

91

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Memory reclaim Watermarks - min_free_kbytes


Free List
All of RAM
Do nothing

Pages High kswapd sleeps above High


kswapd reclaims memory
Pages Low kswapd wakesup at Low
kswapd reclaims memory
Pages Min all memory allocators reclaim at Min
0

92

user processes/kswapd reclaim memory

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/sys/vm/dirty_ratio
Absolute limit to percentage of dirty pagecache
memory
Default is 40%
Lower means less dirty pagecache and smaller IO
streams
Higher means more dirty pagecache and larger IO
streams

93

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/sys/vm/dirty_background_ratio
Controls when dirty pagecache memory starts getting
written.

Default is 10%

Lower

pdflush starts earlier


less dirty pagecache and smaller IO streams

Higher

94

pdflush starts later


more dirty pagecache and larger IO streams

Red Hat Summit 2009 | John Shakshober / Larry Woodman

dirty_ratio and dirty_background_ratio


pagecache
100% of pagecache RAM dirty

pdflushd and write()'ng processes write dirty buffers

dirty_ratio(40% of RAM dirty) processes start synchronous writes


pdflushd writes dirty buffers in background
dirty_background_ratio(10% of RAM dirty) wakeup pdflushd
do_nothing
0% of pagecache RAM dirty

95

Red Hat Summit 2009 | John Shakshober / Larry Woodman

/proc/sys/vm/pagecache
Controls when pagecache memory is deactivated.

Default is 100%

Lower

Prevents swapping out anonymous memory

Higher

96

Favors pagecache pages


Disabled at 100%

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Pagecache Tuning
Filesystem/pagecache Allocation
Accessed(pagecache under limit)

ACTIVE

Aging

INACTIVE
(new -> old)

Accessed(pagecache over limit)

97

Red Hat Summit 2009 | John Shakshober / Larry Woodman

FREE

reclaim

(Hint)flushing the pagecache


echo 1 > /proc/sys/vm/drop_caches
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwa
00224571841078083350196000561136212008317
0022457184107808335019600001039198001000
0022457184107808335019600001021188001000
0022457184107808335019600001035204001000
0022457248107808335019600001008164001000
302242128160176143863600001030197015850
002243610656204344080028361027177032672
0022436106562043440800001026180001000
002243610720212344000080101018300991

98

Red Hat Summit 2009 | John Shakshober / Larry Woodman

(Hint)flushing the slabcache


echo 2 > /proc/sys/vm/drop_caches
[tmp]# cat /proc/meminfo
MemTotal:
3907444 kB
MemFree:
3104576 kB

tmp]# cat /proc/meminfo


MemTotal:
3907444 kB
MemFree:
3301788 kB

Slab:

Slab:

415420 kB

Hugepagesize:

99

2048 kB

Hugepagesize:

Red Hat Summit 2009 | John Shakshober / Larry Woodman

218208 kB
2048 kB

RHEL5 CPUspeed and performance:


Enabled = governor set to ondemand
Looks at cpu usage to regulate power
Within 3-5% of performance for cpu loads
IO loads can keep cpu stepped down -15-30%
Supported in RHEL5 virtualization
To turn off else may leave cpus in reduced step
If its not using performance, then:
# echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Then check to see if it stuck:
# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
Check /proc/cpuinfo to make sure your seeing the expected CPU freq.

Proceed to normal service disable


Service cpuspeed stop
Chkconfig cpuspeed off

100

Red Hat Summit 2009 | John Shakshober / Larry Woodman

CPU Scheduler
Recognizes differences between
logical and physical processors
I.E. Multi-core, hyperthreaded &
chips/sockets
Optimizes process scheduling
to take advantage of shared
on-chip cache, and NUMA memory
nodes

Socket 0
Core 0
Thread 0 Thread 1

Core 1

Socket 1

Thread 0 Thread 1

Thread 0 Thread 1

Implements multilevel run queues


for sockets and cores (as
opposed to one run queue
per processor or per system)
Strong CPU affinity avoids
task bouncing

Process

Process

Process

Process

Process

Process

Process

Process

Process

Process

Requires system BIOS to report


CPU topology correctly

Socket 2

Process

Process

Scheduler Compute Queues


101

Red Hat Summit 2009 | John Shakshober / Larry Woodman


Red Hat Confidential

NUMA related Tuning

Numastat
Numactl
Hugetlbfs
/sys/devices/system/node

102

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL4&5 NUMAstat and NUMActl


EXAMPLES
numactlinterleave=allbigdatabaseargumentsRunbigdatabasewith
itsmemoryinterleavedonallCPUs.
numactlcpubind=0membind=0,1processRunprocessonnode0with
memoryallocatedonnode0and1.
numactlpreferred=1numactlshowSetpreferrednode1andshowthe
resultingstate.
numactlinterleave=allshmkeyfile/tmp/shmkeyInterleaveallofthe
sysvsharedmemoryregiionspecifiedby/tmp/shmkeyoverallnodes.
numactloffset=1Glength=1Gmembind=1file/dev/shm/Atouch
Bindthesecondgigabyteinthetmpfsfile/dev/shm/Atonode1.
numactllocalalloc/dev/shm/fileResetthepolicyforthesharedmem
oryfilefiletothedefaultlocalallocpolicy.

103

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL4&5 NUMAstat and NUMActl


NUMAstat to display system NUMA characteristics on a numasystem
[root@perf5~]#numastat
node3node2node1node0
numa_hit7268482215157244325444
numa_miss0000
numa_foreign0000
interleave_hit2668243127632699
local_node6730677456152115324733
other_node537847595129711
NUMActl to control process and memory

TIP

numactl [ --interleave nodes ] [ --preferred node ] [ --membind nodes ]


[ --cpubind nodes ] [ --localalloc ] command {arguments ...}

App < memory single NUMA zone


Numactl use cpubind cpus within same socket
App > memory of a single NUMA zone
Numactl interleave XY and cpubind XY

104

Red Hat Summit 2009 | John Shakshober / Larry Woodman

LinuxNUMAEvolution(NEWer)
RHEL3,4and5LinpackMultistream
AMD64,8cpudualcore(1/2cpusloaded)
3000000

45
40

PerformanceinKflops

2500000
35

DefaultScheduler
2000000

30
25

1500000
20
1000000

15
10

500000
5
0

0
RHEL3U8

RHEL4U5

RHEL5GOLD

Limitations :
Numa spill to different numa boundaries
Process migrations no way back
Lack of page replication text, read mostly
105

Red Hat Summit 2009 | John Shakshober / Larry Woodman

TasksetAffinity
ColumnE

HugeTLBFS
The Translation Lookaside Buffer (TLB) is a
small CPU cache of recently used virtual to
physical address mappings
TLB misses are extremely expensive on today's
very fast, pipelined CPUs
Large memory applications
can incur high TLB miss rates

TLB

HugeTLBs permit memory to be


managed in very large segments
Example: x86_64
128 data
128 instruction

Standard page: 4KB

Physical Memory

Huge page: 2MB


512:1 difference

Virtual Address
Space

File system mapping interface


Ideal for databases
Example: 128 entry TLB can fully map 256MB
* RHEL6 1GB hugepage support
106

Red Hat Summit 2009 | John Shakshober / Larry Woodman


Red Hat Confidential

Hugepagesbefore
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
0001562365631044401120001871416375109720

$cat/proc/meminfo
MemTotal:16301368kB
MemFree:15623604kB
...
HugePages_Total:0
HugePages_Free:0
HugePages_Rsvd:0
Hugepagesize:2048kB

107

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Hugepagesreserving
$echo2000>/proc/sys/vm/nr_hugepages
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
000115266323116840178000129101566310981
0

$cat/proc/meminfo
MemTotal:16301368kB
MemFree:11526520kB
...
HugePages_Total:2000
HugePages_Free:2000
HugePages_Rsvd:0
Hugepagesize:2048kB

108

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Hugepagesusing
$mountthugetlbfshugetlbfs/huge
$cp1GBfile/huge/junk
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
0001052663231168140178000129101566310981
0
$cat/proc/meminfo
LowTotal:16301368kB
LowFree:11524756kB
...
HugePages_Total:2000
HugePages_Free:1488
HugePages_Rsvd:0
Hugepagesize:2048kB

109

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Hugepagesreleasing
$rm/huge/junk
$cat/proc/meminfo
MemTotal:16301368kB
MemFree:11524776kB
...
HugePages_Total:2000
HugePages_Free:2000
HugePages_Rsvd:0
Hugepagesize:2048kB
$echo0>/proc/sys/vm/nr_hugepages
$vmstat
procsmemoryswapiosystemcpu
rbswpdfreebuffcachesisobiboincsussyidwast
00015620488315124019440071614959109810
$cat/proc/meminfo
MemTotal:16301368kB
MemFree:15620500kB
...
HugePages_Total:0
HugePages_Free:0
HugePages_Rsvd:0
Hugepagesize:2048kB
110

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMAHugepagesreserving
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total:

Node 0 HugePages_Free:

Node 1 HugePages_Total:

Node 1 HugePages_Free:

[root@dhcp-100-19-50 ~]# echo 6000 > /proc/sys/vm/nr_hugepages


[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2980
Node 0 HugePages_Free: 2980
Node 1 HugePages_Total: 3020
Node 1 HugePages_Free: 3020

111

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMAHugepagesusing
[root@dhcp-100-19-50 ~]# mount -t hugetlbfs hugetlbfs /huge
[root@dhcp-100-19-50 ~]# /usr/tmp/mmapwrite /huge/junk 32 &
[1] 18804
[root@dhcp-100-19-50 ~]# Writing 1048576 pages of random junk to file /huge/junk
wrote 4294967296 bytes to file /huge/junk
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2980
Node 0 HugePages_Free: 2980
Node 1 HugePages_Total: 3020
Node 1 HugePages_Free: 972

112

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMAHugepagesusing(overcommit)
[root@dhcp-100-19-50 ~]# /usr/tmp/mmapwrite /huge/junk 33 &
[1] 18815
[root@dhcp-100-19-50 ~]# Writing 2097152 pages of random junk to file /huge/junk
wrote 8589934592 bytes to file /huge/junk
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2980
Node 0 HugePages_Free: 1904
Node 1 HugePages_Total: 3020
Node 1 HugePages_Free:

113

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMAHugepagesreducing
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2980
Node 0 HugePages_Free: 2980
Node 1 HugePages_Total: 3020
Node 1 HugePages_Free: 3020
[root@dhcp-100-19-50 ~]# echo 3000 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total:

Node 0 HugePages_Free:

Node 1 HugePages_Total: 3000


Node 1 HugePages_Free: 3000

114

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NUMAHugepagesfreeing/reserving
[root@dhcp-100-19-50 ~]# echo 6000 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 2982
Node 0 HugePages_Free: 2982
Node 1 HugePages_Total: 3018
Node 1 HugePages_Free: 3018
[root@dhcp-100-19-50 ~]# echo 0 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# echo 3000 > /proc/sys/vm/nr_hugepages
[root@dhcp-100-19-50 ~]# cat /sys/devices/system/node/*/meminfo | grep Huge
Node 0 HugePages_Total: 1500
Node 0 HugePages_Free: 1500
Node 1 HugePages_Total: 1500
Node 1 HugePages_Free: 1500

115

Red Hat Summit 2009 | John Shakshober / Larry Woodman

JVMTuning
Eliminate swapping
Lower swappiness to 10%(or lower if
necessary).
Promote pagecache reclaiming
Lower dirty_background_ratio to 10%
Lower dirty_ratio if necessary
Promote inode cache reclaiming
Lower vfs_cache_pressure

116

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Section 4: Tuning Examples


JVM Tuning
SPECjbb
File systems
EXT3, GFS, NFS, EXT4, XFS
Database Performance
Scaling, Hugepages, AIO/DIO, Elevators, SELinux
Network Tuning

117

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL5.4 KVM Java OpenJDK Performance


Intel Nahalem 2.4 Ghz, 24 Gb mem
250 0 0 0

0 .17

0 .17

20 0 0 0 0

0 .17

0 .17

150 0 0 0

0 .16

Base
Base HugePages
% Virt Huge KVM

0 .16

10 0 0 0 0

0 .16

0 .16

50 0 0 0

0 .16

0 .15

0 .15
4

#cpus

Red Hat Performance NDA Required 2009

118

Java Performance SPECjbb2005


Benchmark on RHEL (contd.)

119

In February 2009 RHEL


running on a 96-core Xeon
based server from NEC
achieved the best score
achieved on a x86_64 server
(= 2,150,260 BOPS).
Use of NUMActl and
HugePages critical to result
See Reference Architecture
talk on Benchmark Papers/
Results

Red Hat Summit 2009 | John Shakshober / Larry Woodman

GeneralPerformanceTuningGuidelines
Use hugepages whenever possible.
Minimize swapping.
Maximize pagecache reclaiming
Place swap partition(s) on quite device(s).
Direct IO if possible.
Beware of turning NUMA off.

120

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Understanding IOzone Results


GeoMean per category are
statistically meaningful.
Understand HW setup
Disk, RAID, HBA, PCI

Layout file systems


LVM or MD devices
Partions w/ fdisk

fdisk /dev/sdX
raw /dev/raw/rawX /dev/sdX1
dd if=/dev/raw/rawX bs=64k

Mount file system

Baseline raw IO DD/DT


EXT3 perf w/ IOzone
In-cache file sizes which fit goal ->
90% memory BW.
Out-of-cache file sizes more tan 2x
memory size
O_DIRECT 95% of raw

Global File System GFS goal -->


90-95% of local EXT3

121

Use raw command

mkfs t ext3 /dev/sdX1


Mount t ext3 /dev/sdX1 /perf1

IOzone commands
Iozone a f /perf1/t1

(incache)

Iozone a -I f /perf1/t1 (w/ dio)


Iozone s 2xmem f /perf1/t1 (big)

Red Hat Summit 2009 | John Shakshober / Larry Woodman

EXT3, GFS, NFS Iozone in cache


RHEL5InCacheIOzoneEXT3,GFS1,GFS2
(Geom1M4GB,1k1m)
1400

PerformanceMB/sec

1200
1000

EXT_inCache
GFS1InCache

800

NFSInCache

600
400
200
0

122

ALLI/ Initial Re
Read Re
Ran
Ran Back RecR Stride Fwrite Fre Fread Fre
O's
Write Write
Read dom dom ward e
Read
Write
Read
Read Write Read Write

Red Hat Summit 2009 | John Shakshober / Larry Woodman


Red Hat Confidential

Using IOzone w/ o_direct mimic database


Problem :
Filesystems use memory for file cache
Databases use memory for database cache
Users want filesystem for management outside database
access (copy, backup etc)
You DON'T want BOTH to cache.
Solution :
Filesystems that support Direct IO
Open files with o_direct option
Databases which support Direct IO (ORACLE)
NO DOUBLE CACHING!

123

Red Hat Summit 2009 | John Shakshober / Larry Woodman

EXT3, GFS, NFS Iozone w/ DirectIO


RHEL5Direct_IOIOzoneEXT3,GFS,NFS
(Geom1M4GB,1k1m)
PerformanceinMB/sec

80
70
60
50
40
30
20
10
0

124

EXT_DIO
GFS1_DIO
NFS_DIO

ALL
I/O's

Initial
Write

ReWrite

Read

ReRead Random Random


Read
Write

Back
ward
Read

RecRe
Write

Red Hat Summit 2009 | John Shakshober / Larry Woodman


Red Hat Confidential

Stride
Read

RHEL5.3 IOzone EXT3, EXT4, XFS eval


RHEL53(120),IOzonePerformance
GeoMean1kpoints,Intel8cpu,16GB,FC
40

PercentRelativetoEXT3

35

30

InCache
DirectI/O
>Cache

25

20

15

10

0
EXT4DEV

125

EXT4BARRIER=0

XFS

XFSBarrier=0

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL5 Oracle 10.2 Performance Filesystems


Intel 8-cpu, 16GB, 2 FC MPIO, AIO/DIO
120000.00

100000.00

80000.00

RHEL53Base8cpus
ext3

60000.00

RHEL53Base8cpus
xfs
RHEL53Base8cpus
ext4

40000.00

20000.00

0.00
10U

126

20U

40U

60U

80U

100U

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Large App and Database Performance


Scaling 1-24 core single servers
Huge Pages
2MB huge pages
Set value in /etc/sysctl.conf (vm.nr_hugepages)

NUMA
Localized memory access for certain workloads improves performance

127

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Oracle OLTP Performance Scaling at 16-cpu

250000

RHEL5.4OracleOLTPPerformance

OLTP(tpm)

200000

Tigerton2.93Ghz
32Gbmem
Nehalem2.687Ghz,
36GBmem

150000

100000

50000

0
RHEL52Base4CPU

RHEL52Base8CPU

RHEL52Base16CPU

#CPUs

128

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Oracle OLTP Performance Scaling at 24-cpu


OracleScalingonIntelDunnington
350000.00

300000.00

250000.00

200000.00

RHEL53Base6CPU
RHEL53Base12CPU
RHEL53Base24CPU

150000.00

100000.00

50000.00

0.00
100U

129

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL5.2Oracle10.2Hugepages
Relativeperformance
160

140

2.6.1890.el5

120

100
4
80
3
60
2

40
20

0
40U

130

60U

80U

100U

Red Hat Summit 2009 | John Shakshober / Larry Woodman

2.6.1890.el5Huge
Pages
%Difference

AsynchronousI/OtoFileSystems
Allows application to continue processing while
I/O is in progress

Synchronous I/O

Eliminates Synchronous I/O stall


Stall for
completi
on

Critical for I/O intensive server applications


Red Hat Enterprise Linux since 2002

App I/O
Request

I/O

Support for RAW devices only


Application

With Red Hat Enterprise Linux 4, significant


improvement:

No stall for
completion

App I/O
Request

Makes benchmark results more appropriate for


real-world comparisons

Application

Red Hat Summit 2009 | John Shakshober / Larry Woodman


Red Hat Confidential

Device
Driver
I/O Request
Issue
I/O

I/O
Completion

131

I/O Request
Completion

Asynchronous I/O

Support for Ext3, NFS, GFS file system access


Supports Direct I/O (e.g. Database
applications)

Device
Driver
I/O Request
Issue

I/O Request
Completion

RHEL5.1 Oracle 10.2 - I/O options


RHEL5.1withOracle10.2I/OOptions
120

100

80

100U
60

40

20

0
AIO+DIO

132

DIOonly

AIOonly

Red Hat Performance NDA Required 2009

Red Hat Summit 2009 | John Shakshober / Larry Woodman

NoAIOorDIO

Disk IO tuning - RHEL4/5


RHEL4/54tunableI/OSchedulers
CFQelevator=cfq.CompletelyFairQueuingdefault,balanced,fair
formultipleluns,adaptors,smpservers
NOOPelevator=noop.Nooperationinkernel,simple,lowcpu
overhead,leaveopttoramdisk,raidcntrletc.
Deadlineelevator=deadline.Optimizeforruntimelikebehavior,low
latencyperIO,balanceissueswithlargeIOluns/controllers(NOTE:
currentbestforFC5)
Anticipatoryelevator=as.InsertsdelaystohelpstackaggregateIO,
bestonsystemw/limitedphysicalIOSATA
RHEL4Setatboottimeoncommandline
RHEL5Changeonthefly
echodeadline>/sys/block/<sdx>/queue/scheduler
133

Red Hat Performance NDA Required 2009

Red Hat Summit 2009 | John Shakshober / Larry Woodman

RHEL5IOschedulesvsRHEL3forDatabase
Oracle10Goltp/dss(relativeperformance)
100.0%
100.0%

CFQ
87.2%

Deadline

108.9%
84.1%
84.8%

Rhel3

77.7%
75.9%

Noop

%tran/min
%queries/hour

As
0.0%

134

28.4%
23.2%

20.0%

40.0%

60.0%

80.0%

100.0%

Red Hat Summit 2009 | John Shakshober / Larry Woodman


Red Hat Confidential

120.0%

Oracle 10g Performance Scaling on RHEL5

135

Oracle OLTP performance


on RHEL scales well to 24
cores.
See Reference Architecture
talk on Benchmark
Papers/Results

Testing on larger servers


with the most recent
x86_64 technology is
anticipated in the coming
year.

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Oracle 10g: RHEL5.4 KVM Virtualization Efficiency


Virtualization Efficiency: Consolidation
Oracle OLTP Load - 160 Tot al Users
250,000

T ransactions / Minute

200,000

150,000

100,000

50,000

0
Bare Metal

2 Guests
8 vCPUs

4 Guests
4 vCPUs

Configuration (Guests x vCPUs)

136

Red Hat Summit 2009 | John Shakshober / Larry Woodman

8 Guests
2 vCPUs

SELinux w/ Oracle OLTP Performance


(steady state performance <5%)
RHEL5.2Oracle10GR4w/SElinux
Intel8cpu,16GB,2FC,OLTPPerf
120000.00

100000.00

OLTPTrans/Min

80000.00

RHEL53Base8cpus
RHEL53Base8cpus
SelinuxEnabled

60000.00

RHEL53Base8cpus
SelinuxPermissive

40000.00

20000.00

0.00
10U

20U

40U

60U

80U

100U

SimulatedUsers(x100)

137

Red Hat Summit 2009 | John Shakshober / Larry Woodman

BenchmarkTuning
Use Hugepages.
Dont overcommit memory
If memory must be over committed
Eliminate all swapping.
Maximize pagecache reclaiming
Place swap partition(s) on separate
device(s).
Use Direct IO
Dont turn NUMA off.

138

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Network Performance Tuning Outline


IRQ Affinity / Processor Affinity - No magic formula
experiment to get the best results
Interrupt coalescing
Chip architectures play a big role
Try to match TX and RX on same socket / data caches
sysctl.conf
Increase/decrease memory parameter for network
Driver Setting
NAPI if driver supports
HW ring buffers
TSO, UFO, GSO
139

Red Hat Summit 2009 | John Shakshober / Larry Woodman

sysctl

sysctl is a mechanism to view and control the entries under the


/proc/sys tree
sysctl -a

- lists all variables

sysctl -q

- queries a variable

sysctl -w

- writes a variable

When setting values, spaces are not allowed


sysctlwnet.ipv4.conf.lo.arp_filter=0

Look at documentation in /usr/src


/usr/src/redhat/BUILD/kernel-2.6.18/linux-2.6.18.x86_64/Documentation/networking

By default, Linux networking not tuned for max performance, more


for reliability
Buffers are especially not tuned for local 10GbE traffic
Remember that Linux autotunes buffers for connections
140

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Some Important settings for sysctl


net.ipv4.tcp_window_scaling - toggles window scaling
Misc TCP protocol
net.ipv4.tcp_timestamps - toggles TCP timestamp support
net.ipv4.tcp_sack - toggles SACK (Selective ACK) support
TCP Memory Allocations - min/pressure/max
net.ipv4.tcp_rmem - TCP read buffer - in bytes
overriden by core.rmem_max
net.ipv4.tcp_wmem - TCP write buffer - in bytes
overridden by core/wmem_max
net.ipv4.tcp_mem - TCP buffer space
measured in pages, not bytes !
141

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Some Important settings for sysctl


CORE memory settings
net.core.rmem_max - max size of rx socket buffer
net.core.wmem_max -max size of tx socket buffer
net.core.rmem_default - default rx size of socket buffer
net.core.wmem_default - default tx size of socket buffer
net.core.optmem_max - maximum amount of option memory buffers
net.core.netdev_max_backlog how many unprocessed rx packets
before kernel starts to drop them
These settings also impact UDP

142

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Control your network route :


Check arp_filter settings with sysctl
sysctl -a | grep arp_filter
A setting of 0 says uses any path
If more than one path between machines, set arp_filter=1
Look for increasing interrupt counts in /proc/interrupt or
increasing counters via ifconfig or netstat
Lab Switch

1G
bE

10GbE

143

Red Hat Summit 2009 | John Shakshober / Larry Woodman

netperf
http://netperf.org
Feature Rich
Read documentation

Default test is TCP_STREAM uses send() call


TCP_SENDFILE uses sendfile() call much less copying
TCP_RR Request / Response tests
UDP_STREAM
Many others

144

Red Hat Summit 2009 | John Shakshober / Larry Woodman

General Network/Disk I/O Tuning Guidelines


To maximize network throughput;
Disable irqbalance
service irqbalance stop
chkconfig irqbalance off

Disable cpuspeed
default gov=ondemand, set governer to performance
Use affinity to maximize multi-core shared cache environments
Process affinity
Use taskset
Interrupt affinity

or MRGs Tuna

grep eth2 /proc/interrupts


echo 80 > /proc/irq/177/smp_affinity

145

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Linux Performance Tuning References


Alikins, System Tuning Info for Linux Servers,
http://people.redhat.com/alikins/system_tuning.html
Axboe, J., Deadline IO Scheduler Tunables, SuSE, EDF R&D, 2003.
Braswell, B, Ciliendo, E, Tuning Red Hat Enterprise Linux on IBMeServer xSeries
Servers, http://www.ibm.com/redbooks
Corbet, J., The Continuing Development of IO Scheduling,
http://lwn.net/Articles/21274.
Ezolt, P, Optimizing Linux Performance, www.hp.com/hpbooks, Mar 2005.
Heger, D, Pratt, S, Workload Dependent Performance Evaluation of the Linux 2.6
IO Schedulers, Linux Symposium, Ottawa, Canada, July 2004.
Red Hat Enterprise Linux Performance Tuning Guide
http://people.redhat.com/dshaks/rhel3_perf_tuning.pdf
Network,NFSPerformancecoveredinseparatetalks
http://nfs.sourceforge.net/nfshowto/performance.html

146

Red Hat Summit 2009 | John Shakshober / Larry Woodman

Questions?

147

Red Hat Summit 2009 | John Shakshober / Larry Woodman

You might also like