Professional Documents
Culture Documents
A quick look at Object Oriented (OO) programming A common example Optimisation of that example Summary
Slide 2
Slide 3
Whats OOP for? OO programming allows you to think about problems in terms of objects and their interactions. Each object is (ideally) self contained
Contains its own code and data. Defines an interface to its code and data.
Are Objects Good? Well, yes And no. First some history.
Slide 6
1979
2009
Slide 7
Named C++
1979 1983
2009
Slide 8
1979
1985
2009
Slide 9
Release of v2.0
1979
1989
2009
Slide 10
1979
1989
2009
Slide 11
Standardised
1979
1998
2009
Slide 12
Updated
1979
2003
2009
Slide 13
C++0x
1979
2009
Slide 14
Slide 15
CPU performance
Slide 16
Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
CPU/Memory performance
Slide 17
Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
What has changed since 1979? One of the biggest changes is that memory access speeds are far slower (relatively)
1980: RAM latency ~ 1 cycle 2009: RAM latency ~ 400+ cycles
OO classes encapsulate code and data. So, an instantiated object will generally contain all data associated with it.
Slide 19
My Claim
Node
Container class
Modifier
Updates transforms
Drawable/Cube
Renders objects
Slide 21
Object
Each object
Maintains bounding sphere for culling Has transform (local and world) Dirty flag (optimisation) Pointer to Parent
Slide 22
Objects
Class Definition Each square is 4 bytes
Memory Layout
Slide 23
Nodes
Slide 24
Nodes
Class Definition
Memory Layout
Slide 25
Update the world transform and world space bounding sphere for each object.
Slide 26
Slide 27
Leaf nodes (objects)this Whats wrong with return transformed bounding code? spheres
Slide 28
Leaf nodesm_Dirty=false thenreturn transformed If (objects) we get branch bounding misprediction which costs 23 or 24 spheres cycles.
Slide 29
Leaf nodes (objects) return transformed of bounding Calculation12 cycles. bounding sphere spheresthe world takes only
Slide 30
Leaf nodes (objects) return transformed bounding So using a dirty using oneis actually spheres flag here (in the case slower than not
where it is false)
Slide 31
L2 Cache
Slide 32
Cache usage
Main Memory parentTransform is already in the cache (somewhere)
L2 Cache
Slide 33
Cache usage
Main Memory
L2 Cache
Slide 34
Cache usage
Main Memory
L2 Cache
Slide 35
Cache usage
Main Memory
L2 Cache
Cache usage
Main Memory
L2 Cache
Cache usage
Main Memory
L2 Cache
Slide 38
Cache usage
Main Memory
L2 Cache
Slide 39
Cache usage
Main Memory
L2 Cache
Slide 40
Cache usage
Main Memory
L2 Cache
Slide 41
Cache usage
Main Memory
L2 Cache
New code checks dirty flag then sets world bounding sphere
Slide 42
Cache usage
Main Memory
L2 Cache
Slide 43
Cache usage
Main Memory
L2 Cache
Cache usage
Main Memory
L2 Cache
Cache usage
Main Memory
L2 Cache
Slide 46
The Test
11,111 nodes/objects in a tree 5 levels deep Every node being transformed Hierarchical culling of tree Render method is empty
Slide 47
Performance
This is the time taken just to traverse the tree!
Slide 48
Why is it so slow?
~22ms
Slide 49
Look at GetWorldBoundingSphere()
Slide 50
Slide 51
if(!m_Dirty) comparison
Slide 52
Slide 53
Slide 54
Slide 55
Slide 56
Slide 57
How can we fix it? And still keep the same functionality and interface?
Slide 58
Slide 59
Slide 60
Allocate contiguous
Nodes Matrices Bounding spheres
Slide 61
Performance
19.6ms -> 12.9ms
35% faster just by moving things around in memory!
Slide 62
What next? Process data in order Use implicit structure for hierarchy
Minimise to and fro from nodes.
Group logic to optimally use what is already in cache. Remove regularly called virtuals.
Slide 63
Hierarchy
Node
Slide 64
Hierarchy
Node
Node
Node
Slide 65
Hierarchy
Node
Node
Node
Slide 66
Hierarchy
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Hierarchy
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Hierarchy
Node
Node
Node
Node
Node
Node
Node
Hierarchy
Node
Use a set of arrays, one per hierarchy level
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 70
Hierarchy
Node
:2
Node
:4
Node
Node
Node
Slide 71
Hierarchy
Node
:2
Node
:4
Node
Node
Node
Node
Node
Node
Node
Node
Slide 72
Slide 73
OO version
Update transform top down and expand WBS bottom up
Slide 74
Update transform
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 75
Update transform
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 76
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Hierarchical bounding spheres pass info up Transforms cascade down Data use and code is striped.
Processing is alternating
Slide 83
Conversion to linear
Slide 84
Node
:4
Node
:4
For each node at each level (top down) { multiply world transform by parents transform wbs by world transform }
Node
Node
Node
Node
Node
Node
Node
Node
Slide 85
Node
:4
Node
:4
For each node at each level (bottom up) { add wbs to parents } cull wbs against frustum }
Node
Node
Node
Node
Node
Node
Node
Node
Slide 86
Slide 87
Slide 89
Unified L2 Cache
Children node Childrens data not needed Parent Data
Childrens Data
Slide 90
Unified L2 Cache
Parent Data
Childrens Data
Slide 91
Unified L2 Cache
Parent Data
Childrens Data
Slide 92
Unified L2 Cache
Parent Data
Childrens Data
Slide 93
Unified L2 Cache
Parent Data
Childrens Data
Slide 94
Unified L2 Cache
Parent Data
Childrens Data
Slide 95
Prefetching
Parent Because all data is linear, we can predict what memory will be needed in ~400 cycles and prefetch Parent Data
Unified L2 Cache
Childrens Data
Slide 96
Tuner scans show about 1.7 cache misses per node. But, these misses are much more frequent
Code/cache miss/cache miss/code Less stalling
Slide 97
Performance
19.6 -> 12.9 -> 4.8ms
Slide 98
Prefetching Data accesses are now predictable Can use prefetch (dcbt) to warm the cache
Data streams can be tricky Many reasons for stream termination Easier to just use dcbt blindly
(look ahead x number of iterations)
Slide 99
Prefetching example
Slide 100
Performance
19.6 -> 12.9 -> 4.8 -> 3.3ms
Slide 101
A Warning on Prefetching
This example makes very heavy use of the cache This can affect other threads use of the cache
Multiple threads with heavy cache use may thrash the cache
Slide 102
Slide 103
Slide 104
Up close
~16.6ms
Slide 105
Slide 106
Performance counters
Branch mispredictions: 2,867 (cf. 47,000) L2 cache misses: 16,064 (cf 36,000)
Slide 107
In Summary
Just reorganising data locations was a win Data + code reorganisation= dramatic improvement. + prefetching equals even more WIN.
Slide 108
OO is not necessarily EVIL Be careful not to design yourself into a corner Consider data in your design
Can you decouple data from objects? code from objects?
Simplify systems
KISS Easier to optimise, easier to parallelise
Slide 110
Homogeneity
Remember
Better performance Better realisation of code optimisations Often simpler code More parallelisable code
Slide 113
The END
Questions?
Slide 114