Professional Documents
Culture Documents
270
1063-640-4 $4.00 0 1994 IEEE
compression applications. Since embedded systems are since the vast majority of instruction fetches result in cache
highly cost sensitive and typically only execute a single hits, the performance of the processor is unchanged for these
program, it is not possible to include temporary storage for instructions.
an uncompressed version of the program. Instead, the In an embedded system, it is not possible to decompress
program must be decompressed on demand at run time, so that the entire program at once; therefore, a block oriented
an uncompressed copy of the next instruction is always
compression scheme is required. The experiments we have
available. The system proposed in [Wolfe92] uses the performed are based on compressing 32-byte cache lines into
existing instruction cache in high-performance processors as smaller byte aligned blocks as shown in Figure 1. A number
a decompression buffer, storing uncompressed copies of of compression techniques are possible, but they all must
recently used fixed-sized blocks of instructions. These fixed-
allow for effective run-time decompression. Compression
sized blocks are decompressed by the cache refill hardware takes place at program development time therefore
whenever there is an instruction cache miss. Since the
compression time is immaterial. Decompression time,,
program must be decompressed in small fixed-sized blocks however, directly impacts cache refill time and thus
rather than the more common approach of decompressing the performance.
entire program from. beginning to end, the most obvious
compression methods require that each block has been
separately compressed. Furthermore, to retain high 8-word fully-aligned Mock
performance it must be possible to decompress a block with 1 I I I I I I I 1
low latency, preferably no longer than a normal cache line ...00 ...04 ...08 ...OC ... 10 ...14 ...18 ...1C
refill.
In addition to the existing work on file-based
compression, automated compression and decompression n-byte unaliined block
schemes have been implemented in general-purpose systems
I 1 I I I I
at slower levels in the memory hierarchy [Tautongl]. ...00 ...0 4 . . . 0 8 ... OC ...10 ...1 4 ...1 8 ...1C
Automated file compression systems such as the Doublespace
utility in MSDOS 6.2 use file and block based compression to Figure 1 - Block Bounded Compression.
reduce the disk space requirement of files. A similar method is
Maintaining full compatibility with existing code
discussed in [Categl] using compression within memory and
disk to compress pages in a demand-paged virtual memory presents a problem when executing control transfer
system. These disk-based systems use large data blocks, on instructions such as jumps or procedure calls. The address of
the jump target in the compressed code is different than it is
the order of 4K-16K bytes, rather than the 14-64 byte blocks
common in instruction caches. Furthermore, disk-based in the uncompressed code. This problem is one reason why
continuous file-based compression is impractical for direct
systems can tolerate decompression latencies on the order of
execution. If a program branches to an instruction at a given
10-2OOms rather than the 50-500ns latency that is tolerable address, how can that instruction be found in the compressed
in embedded system program compression. These differences
in scale allow the effective use of Lempel-Ziv type algorithms program? A specific jump target address in the original code
implemented in either hardware or software for disk or virtual- may not even correspond to an addressable byte boundary in
memory based compression. Unfortunately, this class of the compressed code. While it might be possible to place all
algorithms does not appear to be practical or effective for jump targets on addressable boundaries and replace
short program blocks at cache speeds. uncompressed code target addresses in the original code with
the new compressed code target addresses, this introduces new
A program compression scheme based on dictionary problems. Jump targets that happen to be in cache would
compression has been proposed in [Devedasgrl]. This work have different addresses than the same targets in main
presents some interesting ideas in software-only memory. Furthermore, programs often contain indirect or
compression; however, the experimental results do not yet computed jump targets. To convert these addresses would
validate the methods. The experimental results are based on require modifications to the address computation algorithms
unoptimized assembly code without the inclusion of libraries. in the compiled code.
These examples contain far more redundancy than fully
In-cache expansion solves most gddressing problems.
optimized applications. In fact, simply enabling the
The address of a jump target in cache is the same as in the
optimizer produces smaller code than the dictionary-based
original uncompressed program. If a program jumps to a
compression.
target that is not in cache, that target is brought into the
2.2. Mechanisms for Program Compression cache before execution. This only requires that the processor
locate the address of the beginning of each compressed cache
The key challenge in the development of a code line. This restricts each compressed cache line such that it
compression scheme for existing microprocessor must start on an addressable boundary. Some record of the
architectures is that the system must run all existing new location of each cache line is required to map the program
programs correctly. Furthermore, the performance of a address of each block to its actual physical storage location.
compressed code processor should be comparable to that of a
traditional processor. The use of instruction cache based A new structure is incorporated into the cache refill
decompression assures that these requirements can be met. hardware. The Line Address Table or LAT maps program
All instructions are fetched through the instruction cache. instruction block addresses into compressed code instruction
Since they are stored uncompressed in cache, they can always block addresses. The data in the LAT is generated by the
be fetched from the original program address. Furthermore, compression tool and stored along with the program. Figure
2 diagrams the LAT for a 32-byte cache line.
271
&qypsq&
Cache Une Address
This code, called a Preselected Huj” Code is then built into
the decompression logic rather than stored in memory. The
effectiveness of this code is generally independent of block
size; however, the embedded system compression
mechanisms add additional overhead that limits the
effectiveness of coding.
Huffman codes suffer from inherent inefficiency whenever
the frequency of occurrence of symbols is not exactly a
Figure 2 - Line Address Table. negative power of two. This is a quantization effect caused by
Using a Line Address Table, all compressed code can be the requirement that an integral number of bits is used to code
accessed normally by the processor without modifying the each symbol. This is further compounded in many cases by
processor operation or the program. Line Address Table the fact that 8-bit symbols have been used rather than 32-bit
access increases cache line refill time by a marginal amount, symbols corresponding to the size of RISC instructions.
at least one memory access time. This is not a major effect This is necessary to reduce the complexity of the decoder.
since it only occurs during a cache miss, however this effect Despite these inefficiencies, our experiments show that the
can be further reduced by using another small cache to hold effect of these factors is small. The effectiveness of
the most recently used entries from the LAT. This cache is compression is also reduced by the fact that each compressed
essentially identical to a TLB and in fact is called the Cache block must be stored on a byte-addressable boundary. This
Line Address Lookaside Buffer or CLB. In practice, the LAT adds an average of 3.5 bits to every coded block.
is simply stored in the instruction memory. A base register Another factor contributing to coding overhead is the
value within the cache refill engine is added to the line address storage of the Line Address Table. Storing a full pointer to
during CLB refill in order to index into this table. Figure 3 each compressed line would be prohibitively expensive,
shows how the overall instruction memory hierarchy might however, we have used an ad-hoc compression technique to
be implemented in a typical system. pack multiple pointers into each LAT entry based on storing a
base address plus the length of each compressed line.
According to this design, the compressed cache lines are
aligned on byte boundaries in the compressed program
storage area, and the LAT provides a compact index into these
compressed cache lines. Specifically, if the cache line size is
I bytes, the address space is b bits, and each LAT entry
provides a pointer to each of c cache lines, then each LAT
entry occupies:
b +C . [i0g2(1)] bits
Because each LAT entry locates cl bytes, the overhead
associated with this design is:
!AT
272
investigation into the compressibility of embedded system On each architecture, an instruction extraction tool was
code is warranted. The purpose of these experiments is to created. These tools extract the actual instructions (text
explore the compressibility of code on several modern segment) from executable programs compiled on that
architectures using traditional coding methods modified for architecture. After the program set was compiled on each
embedded systems program compression. machine, the instruction extraction tool was applied to each
Embedded system code is rarely portable among program to isolate the actual program. (This step eliminates
architectures. Therefore, the analysis of this paper is based on the data segments, relocation information, symbol tables,
a set of computer benchmarks which we believe may be a etc.)
suitable approximation to embedded system code. These
benchmarks are taken from the SPEC benchmark suite, the Normalized Program Set Size
UNIXm operating system, and some speech
encoding/decoding programs. The fifteen example programs
are presented in Table I. This set was chosen for its mix of
floating point and integer code, various size programs, and
portability.
Program Function
awk pattern scanning/processing
dnasa7 floating point kernels
dodw thermohydraulical modelization
eqntott Boolean equation translation
espresso Boolean function minimization . Vax ~ MIPS 68020 . SPARC RS6OOO . MPC603
fPPPP quantum chemistry
regular expression matching
gsmtx GSM 06.10 speech CODEC Figure 4. Sum of Program Sizes for Each Machine
matrix300 matrix multiplication
(Normalized to the VAX 11/750)
neqn typeset mathematics For comparison, an architectural comparison of the
Sed stream editor uncompressed text size is presented in Figure 4 where the sum
tomcatv mesh generation of the sizes in the test set is reported for each machine
uvselp NADC speech coding normalized to the total size of the VAX 111750. The native
xlisp lisp interpreter programs differ significantly in size based only on the
yacc yet another compiler compiler architecture and compiler. This results both from differences
in the instruction set encoding and in the speeasize tradeoffs
Table I. Benchmark Set made by the compiler as well as differences in library code.
Each of the programs in Table I was compiled on six This raises the interesting question of whether the less dense
different architectures. This was intended to identify instruction sets contain more redundancy and thus are more
architectural differences (e.g. RISC vs. CISC) which might compressible. We conducted a number of experiments to
affect compressibility. The architectures used are shown in investigate this issue.
Table 11. Five of these architectures are typical of current and The first experiment measures the entropy of each program
future 32-bit high-performance embedded processor cores. using a byte-symbol alphabet. These entropy measures
The VAX is used as a reference point as a high-density determine the maximum possible compression under a given
instruction set. set of assumptions. The zeroth-order entropy assumes that
the probability of an occurrence of a given symbol ai of
alphabet S is given by p ( q ) and is independent of where the
VAX 111750 BSD UNIX 4.3 byte occurs in the byte stream. That is, bytes occur in random
(SGI) MIPS R4ooo IRIX 4.0.5F System V order, but with a certain distribution. The zeroth-order
(Sun) 68020 SunOS 4.1.1 entropy over a source alphabet, S, is then given by
(Sun)SPARC SunOS 4.1.3
(IBM) RS6000 AIX 3.2
273
program set. the aggregate entropy is greater (meaning less
For our purposes, the alphabet, S, is the set of all possible compression) than the average entropy, but this is not always
bytes (8 bits). The probability of a byte occurring was true. In the context of a compressed program embedded
determined by counting the number of Occurrences of that system, the average entropy expresses the typical
byte and dividing by the number of bytes. The encoding tool compression ratio (in the theoretical limit) for a program if
first builds a histogram from the extracted program. The the decompression engine is custom designed for that one
histogram leads directly to the probability distribution, and program. The aggregate entropy represents the limit of
the entropy may then be calculated according to the above average compression if a single code is used for all programs.
equation.
If we change the model of our source, we must also
determine a new form for the entropy. A more general source 1 UZero Order OFmt Order
model is given if we assume that each byte is dependent upon 0.9
the previous byte (as in a first-order Markov process). In this
0.8
case, we have a set of conditional probabilities, p ( a j l a i ),
which indicate the probability that symbol a. occurs gven 0.7
that ai has just occurred. This model is the first-order model 0.6
and the entropy is given by the first-order entropy: 0.5
0.4
0.3
0.2
0.1
Here, p ( a i , aj,, indicates the probability that the pattern 0
ai,aj occurs.
The first-order entropy of each program set was determined
similarly to the zeroth-order entropy. The significant Figure 5. Average Entropy for 6 architectures.
difference between the two measurements is that calculating
the first-order entropy involves generating the n conditional
probabilities for each of the n symbols in the alphabet, S.
During processing, a nxn matrix, h. is generated where h ( i j ) I I
is the number of aj symbols which follow ai symbols. The 0.9
probability of a symbol occurring may then be gwen by: 0.8
0.7
0.6
0.5
0.4
i j
0.3
and the conditional probability of a symbol occurring is
0.2
given by:
0.1
0
Van MIPS 68020 SPARC RS
W MPc603
1
Further, the pattern probability may be found from: Figure 6. Aggregate Entropy for 6 architectures.
It is clear from the zeroth-order entropy numbers, that
simple compression methods like Huffman coding are not
going to achieve very high rates of compression on this type
The average and aggregate calculated values of the entropy of data. In fact, it appears that the MIPS instruction set
for the program set are shown in Figure 5 and Figure 6, originally studied is the most compressible using zeroth-
respectively. The entropy may be interpreted as the order methods. This indicates that simple coding methods are
maximum compression ratio where the compression ratio is likely to be inadequate for other architectures.
given by:
In order to measure the efficiency of Huffman coding on
Compressed Size these architectures, we used the compression method
Compression Ratio = described in Section 2 to actually compress programs from
Uncompressed Size each architecture in cache-line sized blocks. Figure 7
describes the compression ratio obtained and the sources of
The aggregate entropy is measured by generating compression overhead from 32-byte blocks using a 64-bit
occurrence statistics on the program set as a whole, and the LAT entry for each 8 blocks. The coding overhead represents
average entropy is calculated by averaging (arithmetically) the inefficiency in the Huffman code caused by integral length
the separately measured entropy for each program. For our symbols. This difference between the observed compresssion
274
Figure 7. Compression Efficiency
0
0.9.
2 4 8 12 16
0.8.
0.7 . Symbol Size (bits)
0.6 9
mMPC603 mRS6000 nMps 068020 oSPARC mVax
275
varying lengths such that this set of symbols can be Greedy Symbol Selection Algorithm
combined into strings to create any possible cache line. The One alternative to exhaustive search is to employ a greedy
encoding problem thus consists of two parts, selection of a algorithm as in Figure 10. This algorithm successively
set of variable-length symbols for encoding and optimal searches for the next symbol to be included in the set until
partitioning of program blocks into those symbols. One either the maximum set size is reached or the selection criteria
possible set of symbols is the set of all 8-bit symbols; are no longer met. A valid source symbol set must cover the
hence, the byte-symbol Huffman codes are a subset of source program. This requires that some sequence of the
variable-length codes. symbols in the set is equal to the original source program.
Although variable-length codes may provide improved Preferably, the source symbol set covers all legal programs.
compression for embedded programs, the mechanics of The Priority() function used in FindBestSymbol() uses the
generating these codes is computationally complex. Ideally number of occurrences of a symbol, count, and the symbol
one would like to be able to select the optimal set of symbols size in bits, size, to determine a priority for inclusion of the
such that a program encoded using those symbols is of symbol in the symbol set (the symbol with the highest
minimum length. Unfortunately, this problem cannot be priority is to be included). This priority is an estimation of
solved in reasonable time. Assuming that the maximum the savings that inclusion of the symbol represents. Two
number of symbols to be included in the source model is n ,
versions of the Priority() function are under evaluation. The
276
4. Conclusions and Future Work Massachusetts Institute of Technology,
This paper presents several experiments conceming the 1994.
effectiveness of program compression for embedded systems. [Hamming801 R.W. Hamming, Coding and Information
Simple comparisons of six 32-bit architectures show that Theory, Prentice-Hall, Englewood Cliffs, NJ,
significant variations in program size exist between 1980.
architectures using native program coding. Despite the large
variations in program density, compressibility varies only [Huffman521 D. A. Huffman, "A Method for the
moderately among architectures and appears to be Construction of Minimum-Redundancy
uncorrelated to uncompressed program size. Simple Codes," Proceedings of the IRE, Volume 40,
compression methods such as Huffman coding and its variants pp. 1098-1101, (1952).
provide only moderate compression on all of these [Intrate1-921 G. Intrader and I. Spillinger, "Performance
architectures; however, first-order entropy analysis and Evaluation of a Decoded Instruction Cache for
Lempel-Ziv based coding demonstrate that better Variable Instruction-Length Computers,"
compression is possible. Proc. of the 19th Symp. Comp. Arch., IEEE
In order to discover an improved coding method for Computer Society, May 1992.
embedded systems programs, a class of variable source [Kocsis891 A.M. Kocsis, "Fractal Based Image
symbol length codes is considered. The complexity of Compression," 1989 Twenty-Third Asilomar
determining optimal variable source symbol length codes is Conference on Signals, Systems, and
shown to be intractable; however, two greedy heuristics are Computers, Volume 1, pp. 177-181.
evaluated. These heuristics provide compression that is no
better than common coding methods. [Lempe176] A. Lempel, and J. Ziv, "On the Complexity of
Finite Sequences." IEEE Transactions on
The most obvious area for continued research is the Information Theory, Volume 20, pp. 75-81,
development of more effective heuristics for variable source 1976.
symbol length codes. In addition, compiler-based methods
are under investigation to improve compressibility. [Petajan92] E. Petajan, "Digital Video Coding Techniques
Instruction selection, register allocation, and instruction for US High-DefinitionTV,"IEEE Micro,
scheduling may all be optimized for low entropy. This may pp. 13-21, October 1992.
improve practical compression rates. This may be extended
into earlier optimization stages of the compiler in order to [StorerSS] J. A. Storer, Data Compression: Methods and
increase the similarity among program structures and thus Theory, Computer Science Press, Rockville,
increase opportunities for entropy reduction. Concurrently MD, 1988.
we are investigating opportunities for using the compiler to [Tauton91] M. Taunton, "Compressed Executables: an
reduce program size prior to compression. This primarily Exercise in Thinking Small." Proceedings of
involves detection of opportunities to combine similar code the Summer 1991 Usenix Conference, pp.
sequences. Finally, some investigations could be made into 385-403.
designing instruction sets which possess good compressed
program characteristics. [Wolfe92] A. Wolfe and A. Chanin,"Executing
Compressed Programs on An Embedded RISC
5. References Architecture," in proc. Micro-25: The 25th
Annual International Symposium on
[Cate91] V. Cate and T. Gross, "Combining the Microarchitecture , 1992.
Concepts of Compression and Caching for a
Two-Level Filesystem", Proc. Fourth [Ziv77] J. Ziv and A. Lempel. "A Universal Algorithm
International Con5 on Architectural Support for Sequential Data Compression," IEEE
for Programming Languages and Operating Transactions on Information Theory, Volume
Systems, ACM, April 1991. 23, pp. 337-343, 1977.
[Devedas94] S. Devedas, S. Laio, and K. Keutzer, "On Code [Ziv78] J. Ziv and A. Lempel. "Compression of
Size Minimization Using Data Compression Individual Sequences Via Variable-Rate
Techniques", Research Laboratory of Coding," IEEE Transactions on Information
Electronics Technical Memorandum 94/18, Theory, Volume 24, pp. 530-536. 1978.
277