Professional Documents
Culture Documents
Chun Yang
Jing Wang
Dong Tong
Keyi Wang
Abstract
1.
Introduction
2.
Figure 1 describes the general workflow of indirect branch handling in DBT systems. When the branch target is obtained, its a
SPC address, and the PC dispatcher is used to convert SPC to
related code cache address (TPC). A mapping table is probably
needed during address translation. The dispatcher will jump to the
TPC after address translation. Because indirect branch always has
multiple targets, so the PC dispatcher is a single-entry/multi-exits
routine.
There are many methods to implement PC dispatcher. Classified by address translation method, the PC dispatcher can be prediction-based or table-lookup based. The prediction-based method
can achieve a better performance while prediction hits, but have to
perform a full table lookup while prediction fails.
Classified by code placement, the PC dispatcher can be private
or shared. The private dispatch is always inlined in the code cache
to eliminate the overhead of context switch. The shared PC dispatcher is used by more than one indirect branch, which can reduce
the total code size.
Classified by implementation, the PC dispatcher can be
pure-software or SW/HW co-designed. The co-designed mechanisms always achieve a considerable performance improvement,
but poor versatility.
2.1. Related Work
In this section, we will discuss the previous indirect branch handling mechanisms of DBT systems. For clarity, we classified these
mechanisms by whether special hardware or instruction is needed.
2.1.1.
2.1.2.
Co-Designed Mechanisms
Pure-Software Mechanisms
3.
Source Binary
Code Cache
Code
Block
Source Binary
Instructions
Jump to $TPC
$SPC
SPIRE Entry
Source Binary
Instructions
Code
Block
IB
2. Redirecting
Code
Block
run correctly in all cases, SPIRE should keep the corrupted source
binary code totally transparency to other code.
Architecture-Specific Issues. There are also some implemented issues related to the features of source/host architectures.
For example, SPIRE may leverage page protection scheme to
keep correctness, which relies on the support of memory management unit; the redirecting trampoline should contain a jump
instruction, which may be limited by the addressing mode of ISA.
Next, we will discuss the general solutions to these challenges,
and explore more details of SPIRE implementations on several
typical platforms.
3
4
3.5. Optimization
A profiling scheme is adopted by SPIRE to selectively optimize
the indirect branch, and an invalidation routine will be called when
SPIRE may do harm to the performance.
Online Profiling. As noted by many researchers, most of execution time costs on a small part of hot code, so do the indirect
branches. Most of the dynamic executions of indirect branch happen on several hot branch sites, and some cold indirect branches
only execute several times. As shown in Figure 3, on our benchmarks, about 40% indirect branch instructions execute less than
1000 times, and contribute less than 1% of the total executions.
SPIRE relies on the page fault and software exception to trigger
the table filling routine, which may cost more than 1000 cycles.
Thus using SPIRE to handle the cold indirect branches isnt worthy.
Furthermore, more shadow pages will take more extra memory.
To minimalize the overhead of SPIRE, we explore an online
profiling mechanism to identify hot instructions. When the execution count of an indirect branch exceeds the pre-defined threshold,
it will be marked as hot instruction. Only hot instructions will be
optimized by SPIRE, and cold indirect branches are still handled by
other mechanism like hash lookup.
3.4. Transparency
(0,1000]
(1000,1E+7]
(1E+8,+]
100%
80%
60%
40%
20%
0%
with the source binary space. So that the original code pages wont
be corrupted. With this mechanism, the table-entering routine
should jump to the address at SPC + Offset, instead of the original
SPC.
In this case, instead of the original 1-instruction table entering
route, SPIRE needs to cost about 5 instructions to add the offset to
SPC, but still much better than hash lookup. The following code
sequence is an example of SPIRE table entering routine.
4.
push %ecx;
// spill register
Implementation of SPIRE
pop %ecx;
je
0x80a980a
0x80a9809:
gcc
crafty
Test Suite
perlbmk
gap
perlbench
Language
sjeng
richard
C++
C++
250
50
100
131
419
167
48
50
1.37
4.14
6.24
12.0
28.3
82.3
40.1
6.58
4.19
23.1
11.3
11.9
24.1
79.1
59.2
5.52
0.3%
0.2%
0.5%
1.0%
1.2%
1.0%
0.7%
1.2%
5.
eon
Evaluation
Normalized execution time is the ratio of the execution time running on DBT system and on native machine, which is widely used
to measure the performance of DBT system. We use overhead
reduction to measure the performance improvement of our
mechanism, the definition is as follows:
Overhead = Normalized Execution Time 1
Overhead Re duction =
Overhead before
opt
Overhead before
*100%
Hardware
Operation System
Compiler
DBT system
opt
HDTrans-SPIRE
2.0
3.66
HDTrans-0.4
2.53
2.25
HDTrans-SP
2.64
DynamoRIO-3.1
2.48
Pin-2.7
2.62
2.37
1.8
1.6
1.4
1.2
1.0
gcc
crafty
eon
perlbmk
gap
perlbench
sjeng
richards
GeoMean
HDTrans-SPIRE
HDTrans-0.4
Ratio of Chain
Length of Chain
100%
3
2.5
80%
60%
1.5
40%
20%
0.5
0%
HDTrans-SP
1.4
1.2
1.0
1.4
HDTrans-SPIRE
HDTrans-0.4
HDTrans-SP
1.3
1.2
1.1
1.0
10%
8%
6%
4%
2%
0%
containing SPIRE table entries will take the physical memory. The
online profiling mechanism further reduces the number of shadow
pages. As shown in Figure 11, SPIRE consumes 5.6% more physical memory than before on average, which is acceptable.
HDTrans-SP
6
4
2
0
HDTrans-SP
15
HDTrans-0.4
HDTrans-SPIRE
HDTrans-SPIRE
12
9
6
3
0
HDTrans-SPIRE
HDTrans-0.4
HDTrans-SP
120
100
80
60
40
20
0
HDTrans-SPIRE
HDTrans-0.4
table entry will redirect the control flow to related TPC. Only 26
instructions are needed to handle an indirect branch execution.
We implemented our SPIRE mechanism on x86 platform. The
experiments prove that compared with the previous techniques,
SIPRE can significantly improve the performance, while introducing an acceptable memory overhead. The evaluation of micro-architecture features and the evaluation on different CPUs also
illustrate the effectiveness of SPIRE. Furthermore, we discuss the
implementing issues of SPIRE on different architectures.
SPIRE doesnt need special hardware or instructions, so it can
be widely used in DBT systems. It also can cooperate with other
optimization mechanism. For example, it can be combined with
software prediction. With this adaptive mechanism, the highly
predictable IBs will be handled by software prediction, and the
hard-to-predict IBs will be handled by SPIRE. We believe that the
idea of SPIRE can also be applied on other occasions that need
address translation.
HDTrans-SP
2.0
1.6
1.2
0.8
0.4
0.0
References
[1] S. Bansal and A. Aiken. Binary translation using peephole
superoptimizers. in Proceedings of the 8th USENIX conference on
Operating systems design and implementation, San Diego, California,
pages 177-192, 2008.
HDTrans-SPIRE
HDTrans-0.4
HDTrans-SP
1.8
1.6
1.4
1.2
1.0
Opteron-2214
I3-550
Atom-330
6.
Conclusion
10
[13] H.-S. Kim and J. E. Smith. Hardware Support for Control Transfers in
Code Caches. in Proceedings of the 36th annual IEEE/ACM
International Symposium on Microarchitecture, pages 253-264, 2003.
[25] D.-Y. Hong, C.-C. Hsu, P.-C. Yew, J.-J. Wu, W.-C. Hsu, P. Liu, C.-M.
Wang, and Y.-C. Chung. HQEMU: a multi-threaded and retargetable
dynamic binary translator on multicores. in Proceedings of the Tenth
International Symposium on Code Generation and Optimization, San
Jose, California, USA, pages 104-113, 2012.
11