Pitfalls of Object Oriented Programming GCAP 09

Sony Computer Entertainment Europe Research & Development Division
Pitfalls of Object Oriented Programming

Tony Albrecht Technical Consultant Developer Services
What I will be covering
A quick look at Object Oriented (OO) programming A common example Optimisation of that example Summary
Slide 2
Object Oriented (OO) Programming

What is OO programming?
a programming paradigm that uses "objects" data structures consisting of datafields and methods together with their interactions to design applications and computer programs. (Wikipedia)
Includes features such as

Data abstraction Encapsulation Polymorphism Inheritance
Slide 3
Whats OOP for? OO programming allows you to think about problems in terms of objects and their interactions. Each object is (ideally) self contained
Contains its own code and data. Defines an interface to its code and data.
Each object can be perceived as a black box.

Slide 4
Objects If objects are self contained then they can be

Reused. Maintained without side effects. Used without understanding internal implementation/representation.
This is good, yes?

Slide 5
Are Objects Good? Well, yes And no. First some history.
Slide 6
A Brief History of C++
C++ development started
1979
2009
Slide 7
Named C++
1979 1983
2009
Slide 8
First Commercial release
1979
1985
2009
Slide 9
Release of v2.0
1979
1989
2009
Slide 10

Added multiple inheritance, abstract classes, static member functions, Release of v2.0 const member functions protected members.
1979
1989
2009
Slide 11
Standardised
1979
1998
2009
Slide 12
Updated
1979
2003
2009
Slide 13
C++0x
1979
2009
Slide 14
So what has changed since 1979?

Many more features have been added to C++ CPUs have become much faster. Transition to multiple cores Memory has become faster.
http://www.vintagecomputing.com
Slide 15
CPU performance
Slide 16
Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
CPU/Memory performance
Slide 17
Computer architecture: a quantitative approach By John L. Hennessy, David A. Patterson, Andrea C. Arpaci-Dusseau
What has changed since 1979? One of the biggest changes is that memory access speeds are far slower (relatively)
1980: RAM latency ~ 1 cycle 2009: RAM latency ~ 400+ cycles
What can you do in 400 cycles?

Slide 18
What has this to do with OO?
OO classes encapsulate code and data. So, an instantiated object will generally contain all data associated with it.
Slide 19
My Claim
With modern HW (particularly consoles), excessive encapsulation is BAD.

Data flow should be fundamental to your design (Data Oriented Design)
Slide 20
Consider a simple OO Scene Tree
Base Object class

Contains general data
Node
Container class
Modifier
Updates transforms
Drawable/Cube
Renders objects
Slide 21
Object
Each object
Maintains bounding sphere for culling Has transform (local and world) Dirty flag (optimisation) Pointer to Parent
Slide 22
Objects
Class Definition Each square is 4 bytes
Memory Layout
Slide 23
Nodes
Each Node is an object, plus

Has a container of other objects Has a visibility flag.
Slide 24
Nodes
Class Definition
Memory Layout
Slide 25
Consider the following code
Update the world transform and world space bounding sphere for each object.
Slide 26
Leaf nodes (objects) return transformed bounding spheres
Slide 27
Leaf nodes (objects)this Whats wrong with return transformed bounding code? spheres
Slide 28
Leaf nodesm_Dirty=false thenreturn transformed If (objects) we get branch bounding misprediction which costs 23 or 24 spheres cycles.
Slide 29
Leaf nodes (objects) return transformed of bounding Calculation12 cycles. bounding sphere spheresthe world takes only
Slide 30
Leaf nodes (objects) return transformed bounding So using a dirty using oneis actually spheres flag here (in the case slower than not
where it is false)
Slide 31
Lets illustrate cache usage

Main Memory
Each cache line is 128 bytes
L2 Cache
Slide 32
Cache usage
Main Memory parentTransform is already in the cache (somewhere)
L2 Cache
Slide 33
Cache usage
Main Memory
Assume this is a 128byte boundary (start of cacheline)
L2 Cache
Slide 34
Cache usage
Main Memory
L2 Cache
Load m_Transform into cache
Slide 35
Cache usage
Main Memory
L2 Cache
m_WorldTransform is stored via cache (write-back)

Slide 36
Cache usage
Main Memory
L2 Cache
Next it loads m_Objects

Slide 37
Cache usage
Main Memory
L2 Cache
Then a pointer is pulled from somewhere else (Memory managed by std::vector)
Slide 38
Cache usage
Main Memory
L2 Cache
vtbl ptr loaded into Cache
Slide 39
Cache usage
Main Memory
L2 Cache
Look up virtual function
Slide 40
Cache usage
Main Memory
L2 Cache
Then branch to that code (load in instructions)
Slide 41
Cache usage
Main Memory
L2 Cache
New code checks dirty flag then sets world bounding sphere
Slide 42
Cache usage
Main Memory
L2 Cache
Nodes World Bounding Sphere is then Expanded
Slide 43
Cache usage
Main Memory
L2 Cache
Then the next Object is processed

Slide 44
Cache usage
Main Memory
L2 Cache
First object costs at least 7 cache misses

Slide 45
Cache usage
Main Memory
L2 Cache
Subsequent objects cost at least 2 cache misses each
Slide 46
The Test
11,111 nodes/objects in a tree 5 levels deep Every node being transformed Hierarchical culling of tree Render method is empty
Slide 47
Performance
This is the time taken just to traverse the tree!
Slide 48
Why is it so slow?
~22ms
Slide 49
Look at GetWorldBoundingSphere()
Slide 50
Samples can be a little misleading at the source code level
Slide 51
if(!m_Dirty) comparison
Slide 52
Stalls due to the load 2 instructions earlier
Slide 53
Similarly with the matrix multiply
Slide 54
Some rough calculations

Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms
Slide 55
Some rough calculations

Branch Mispredictions: 50,421 @ 23 cycles each ~= 0.36ms L2 Cache misses: 36,345 @ 400 cycles each ~= 4.54ms
Slide 56
From Tuner, ~ 3 L2 cache misses per object

These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) Code/cache miss/code/cache miss/code
Slide 57
Slow memory is the problem here
How can we fix it? And still keep the same functionality and interface?
Slide 58
The first step
Use homogenous, sequential sets of data
Slide 59
Homogeneous Sequential Data
Slide 60
Generating Contiguous Data Use custom allocators

Minimal impact on existing code
Allocate contiguous
Nodes Matrices Bounding spheres
Slide 61
Performance
19.6ms -> 12.9ms
35% faster just by moving things around in memory!
Slide 62
What next? Process data in order Use implicit structure for hierarchy
Minimise to and fro from nodes.
Group logic to optimally use what is already in cache. Remove regularly called virtuals.
Slide 63
We start with a parent Node
Hierarchy
Node
Slide 64
Hierarchy
Node
Node
Which has children nodes
Node
Slide 65
Hierarchy
Node
Node
Node
And they have a parent
Slide 66
Hierarchy
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
And they have children

Slide 67
Hierarchy
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
And they all have parents

Slide 68
Hierarchy
Node
Node
Node
Node
Node Node Node Node Node
Node
Node NodeNode Node Node
Node
Node
A lot of this information can be inferred Slide 69
Hierarchy
Node
Use a set of arrays, one per hierarchy level
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 70
Hierarchy
Node
:2
Parent has 2 children Node

:4
Node
:4
Children have 4 children Node Node Node Node Node
Node
Node
Node
Slide 71
Hierarchy
Node
:2
Ensure nodes and their data are contiguous in memory Node

:4
Node
:4
Node
Node
Node
Node
Node
Node
Node
Node
Slide 72
Make the processing global rather than local

Pull the updates out of the objects.
No more virtuals
Easier to understand too all code in one place.
Slide 73
Need to change some things
OO version
Update transform top down and expand WBS bottom up
Slide 74
Update transform
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 75
Update transform
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Slide 76
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Update transform and world bounding sphere

Slide 77
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Add bounding sphere of child

Slide 78
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node

Slide 79
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node

Slide 80
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node

Slide 81
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node

Slide 82
Hierarchical bounding spheres pass info up Transforms cascade down Data use and code is striped.
Processing is alternating
Slide 83
Conversion to linear
To do this with a flat hierarchy, break it into 2 passes

Update the transforms and bounding spheres(from top down) Expand bounding spheres (bottom up)
Slide 84
Transform and BS updates

Node
:2
Node
:4
Node
:4
For each node at each level (top down) { multiply world transform by parents transform wbs by world transform }
Node
Node
Node
Node
Node
Node
Node
Node
Slide 85
Update bounding sphere hierarchies

Node
:2
Node
:4
Node
:4
For each node at each level (bottom up) { add wbs to parents } cull wbs against frustum }
Node
Node
Node
Node
Node
Node
Node
Node
Slide 86
Update Transform and Bounding Sphere

How many children nodes to process
Slide 87
For each child, update transform and bounding sphere

Slide 88
Note the contiguous arrays
Slide 89
So, whats happening in the cache?

Parent
Unified L2 Cache
Children node Childrens data not needed Parent Data
Childrens Data
Slide 90
Load parent and its transform

Parent
Unified L2 Cache
Parent Data
Childrens Data
Slide 91
Load child transform and set world transform

Parent
Unified L2 Cache
Parent Data
Childrens Data
Slide 92
Load child BS and set WBS

Parent
Unified L2 Cache
Parent Data
Childrens Data
Slide 93

Parent Next child is calculated with no extra cache misses !
Unified L2 Cache
Parent Data
Childrens Data
Slide 94

Parent next 2 children incur 2 The cache misses in total
Unified L2 Cache
Parent Data
Childrens Data
Slide 95
Prefetching
Parent Because all data is linear, we can predict what memory will be needed in ~400 cycles and prefetch Parent Data
Unified L2 Cache
Childrens Data
Slide 96
Tuner scans show about 1.7 cache misses per node. But, these misses are much more frequent
Code/cache miss/cache miss/code Less stalling
Slide 97
Performance
19.6 -> 12.9 -> 4.8ms
Slide 98
Prefetching Data accesses are now predictable Can use prefetch (dcbt) to warm the cache
Data streams can be tricky Many reasons for stream termination Easier to just use dcbt blindly
(look ahead x number of iterations)
Slide 99
Prefetching example
Prefetch a predetermined number of iterations ahead Ignore incorrect prefetches
Slide 100
Performance
19.6 -> 12.9 -> 4.8 -> 3.3ms
Slide 101
A Warning on Prefetching
This example makes very heavy use of the cache This can affect other threads use of the cache
Multiple threads with heavy cache use may thrash the cache
Slide 102
The old scan

~22ms
Slide 103
The new scan

~16.6ms
Slide 104
Up close
~16.6ms
Slide 105
Looking at the code (samples)
Slide 106
Performance counters
Branch mispredictions: 2,867 (cf. 47,000) L2 cache misses: 16,064 (cf 36,000)
Slide 107
In Summary
Just reorganising data locations was a win Data + code reorganisation= dramatic improvement. + prefetching equals even more WIN.
Slide 108
OO is not necessarily EVIL Be careful not to design yourself into a corner Consider data in your design
Can you decouple data from objects? code from objects?
Be aware of what the compiler and HW are doing

Slide 109
Its all about the memory
Optimise for data first, then code.

Memory access is probably going to be your biggest bottleneck
Simplify systems
KISS Easier to optimise, easier to parallelise
Slide 110
Homogeneity
Keep code and data homogenous

Avoid introducing variations Dont test for exceptions sort by them.
Not everything needs to be an object

If you must have a pattern, then consider using Managers
Slide 111
Remember
You are writing a GAME

You have control over the input data Dont be afraid to preformat it drastically if need be.
Design for specifics, not generics (generally).

Slide 112
Data Oriented Design Delivers
Better performance Better realisation of code optimisations Often simpler code More parallelisable code
Slide 113
The END
Questions?
Slide 114

Pitfalls of Object Oriented Programming GCAP 09

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pitfalls of Object Oriented Programming GCAP 09

Uploaded by

Copyright:

Available Formats

Sony Computer Entertainment Europe Research & Development Division

Pitfalls of Object Oriented Programming

What I will be covering

Object Oriented (OO) Programming

Includes features such as

Each object can be perceived as a black box.

Objects If objects are self contained then they can be

This is good, yes?

A Brief History of C++

C++ development started

A Brief History of C++

A Brief History of C++

First Commercial release

A Brief History of C++

A Brief History of C++

A Brief History of C++

A Brief History of C++

A Brief History of C++

So what has changed since 1979?

What can you do in 400 cycles?

What has this to do with OO?

With modern HW (particularly consoles), excessive encapsulation is BAD.

Consider a simple OO Scene Tree

Base Object class

Each Node is an object, plus

Consider the following code

Consider the following code

Leaf nodes (objects) return transformed bounding spheres

Consider the following code

Consider the following code

Consider the following code

Consider the following code

Lets illustrate cache usage

Each cache line is 128 bytes

Assume this is a 128byte boundary (start of cacheline)

Load m_Transform into cache

m_WorldTransform is stored via cache (write-back)

Next it loads m_Objects

Then a pointer is pulled from somewhere else (Memory managed by std::vector)

vtbl ptr loaded into Cache

Look up virtual function

Then branch to that code (load in instructions)

Nodes World Bounding Sphere is then Expanded

Then the next Object is processed

First object costs at least 7 cache misses

Subsequent objects cost at least 2 cache misses each

Samples can be a little misleading at the source code level

Stalls due to the load 2 instructions earlier

Similarly with the matrix multiply

Some rough calculations

Some rough calculations

From Tuner, ~ 3 L2 cache misses per object

Slow memory is the problem here

The first step

Use homogenous, sequential sets of data

Homogeneous Sequential Data

Generating Contiguous Data Use custom allocators

We start with a parent Node

Which has children nodes

And they have a parent

And they have children

And they all have parents

Node Node Node Node Node