You are on page 1of 14

Area, Delay, and Power Characteristics of Standard-Cell

Implementations of the AES S-Box


Stefan Tillich, Martin Feldhofer, Thomas Popp
Institute for Applied Information Processing and Communications
Graz University of Technology, Inffeldgasse 16a, A8010 Graz, Austria
{stillich,mfeldhof,tpopp}@iaik.tugraz.at
Johann Grosch adl
University of Bristol, Department of Computer Science
Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, U.K.
johann.groszschaedl@cs.bris.ac.uk
Abstract
Cryptographic substitution boxes (S-boxes) are an integral part of modern block ciphers
like the Advanced Encryption Standard (AES). There exists a rich literature devoted to the
efcient implementation of cryptographic S-boxes, wherein hardware designs for FPGAs and
standard cells received particular attention. In this paper we present a comprehensive study of
different standard-cell implementations of the AES S-box with respect to timing (i.e. critical
path), silicon area, power consumption, and combinations of these cost metrics. We examine
implementations which exploit the mathematical properties of the AES S-box, constructions
based on hardware look-up tables, and dedicated low-power solutions. Our results show that
the timing, area, and power properties of the different S-box realizations can vary by up
to almost an order of magnitude. In terms of area and area-delay product, the best choice
are implementations which calculate the S-box output. On the other hand, the hardware
look-up solutions are characterized by the shortest critical path. The dedicated low-power
implementations do not only reduce power consumption by a large degree, but they also show
good timing properties and offer the best power-delay and power-area product, respectively.
1 Introduction
The Internet of the 21st century will consist of billions of non-traditional computing systems like
cell phones, PDAs, sensor nodes, and other mobile devices (gadgets) with wireless networking
capability. Wireless networking, along with the fact that many of these devices (e.g. sensor nodes)
are easily accessible, have raised a number of security concerns. Sophisticated security protocols, in
combination with well-established cryptographic primitives, can ensure privacy and integrity of
communication over insecure networks. Consequently, there is an increasing demand to implement
cryptographic algorithms on resource-limited embedded devices like cell phones, PDAs, or sensor
nodes. Even some extremely constrained systems like Radio Frequency Identication (RFID) tags
are required to perform cryptographic operations.
The Advanced Encryption Standard (AES), which was announced by the NIST in 2001, denes
one of the most important symmetric ciphers for the long-term future [15]. The AES algorithm
Journal of Signal Processing Systems, vol. 50, no. 2, pp. 251261, Feb. 2008. Springer-Verlag, 2008.
A preliminary version of this paper was published in the proceedings of SAMOS 2006, LNCS 4017, pp. 457466.
is a variant of the Rijndael cipher [4] and can be implemented efciently in both hardware
and software. Common AES hardware implementations take the form of stand-alone ASICs and
cryptographic coprocessors for system-on-chip (SoC) integration. In addition, hardware/software
co-design techniques like extending the instruction set of a general-purpose processor have been
investigated in the recent past [19]. Due to the high performance of modern microprocessors, AES
software can reach throughput rates that are sufcient for most applications. Therefore, hardware
implementations of the AES algorithm are mainly important for high-end server systems with
extreme performance requirements and for embedded devices with a demand for low power
consumption and small silicon area.
Most of the published AES hardware designs focus on high speed and high throughput for
implementation in FPGAs [3, 11, 16]. In addition, some ASIC implementations have been reported
in the literature. For example, Hodjat et al. developed a 3.84 Gbits/s AES crypto coprocessor with
modes-of-operation support based on a 0.18 m CMOS technology [7]. Their design features a
128-bit datapath and encrypts a block of data in 11 clock cycles. A completely different design
approach is necessary when optimizing AES hardware for low power consumption or small silicon
area. Feldhofer et al. introduced an AES implementation suited for passively-powered devices like
RFID tags [6]. It comprises an 8-bit datapath which occupies an area of 3,595 gates (including
registers and control logic) when synthesized using a 0.35 m standard cell library. These results
show that the AES algorithm allows for a wide range of trade-offs between performance, power
consumption, and hardware cost [5].
Symmetric ciphers like the AES require non-linear functions in order to resist linear crypt-
analysis. Substitution is a common function for introducing non-linearity. A substitution function,
generally referred to as S-box, can be realized in form of an arbitrary mapping from input bits
to output bits (e.g. DES [14]) or via algebraic operations (e.g. AES). Different cipher algorithms
use different numbers of S-boxes. For example, DES uses eight S-boxes which map six to four
bits, while AES employs a single S-box which is a bijective mapping from eight to eight bits. The
AES algorithm makes use of its S-box in the SubBytes round transformation and the key expansion
[4]. From a mathematical point of view, the AES S-box is dened as an inversion in the nite eld
F
2
8 with a specic irreducible polynomial [9], followed by an afne transformation. The inverse
S-box, which is required for the InvSubBytes round transformation for decryption, is simply the
inverse of the afne transformation, followed by an inversion in F
2
8 .
The S-box is a costly and performance-critical building block of the AES algorithm. Results
from previous work [7, 21] show that the S-box lies on the critical path of many AES architectures
and, hence, limits the maximum clock frequency. In addition, the S-box also impacts area and power
consumption of AES hardware [17, 13]. Therefore, the AES S-box has been a subject of intensive
research in recent years, which has led to a rich literature on efcient S-box design and imple-
mentation. The proposed designs can be roughly categorized into S-boxes that contain optimized
circuits for arithmetic in F
2
8 [2, 17, 20], constructions using hardware look-up tables [8, 11], and
dedicated low-power solutions [1], all of which have their specic advantages and disadvantages
with respect to area, delay, and power consumption. Although most papers introducing new S-box
designs provide implementation results and discuss related work, it is generally difcult to compare
the different design approaches since, for example, the implementations may have been produced
using different design ows and tools, different standard cell libraries, or different optimizations
(speed, area) for the synthesis process.
In this paper we analyze and compare silicon area, critical path delay, and power consumption
characteristics of the most common standard-cell designs of the AES S-box in a uniform and
coherent way. We consider in our study designs which exploit the mathematical properties of the
S-box, constructions based on hardware look-up tables, and dedicated low-power solutions. In
contrast to our previous work [18] where we used a 0.35 m standard-cell library to evaluate
different S-box designs, we conducted the present study on basis of a more modern 0.25 m process
technology in order to provide practical insights and results that are closer to the state-of-the-art in
VLSI manufacturing. We put similar effort into optimizing each of the evaluated S-box designs to
ensure a fair comparison. Our results show that the area, delay, and power gures of the different
S-box designs vary signicantly (up to almost an order of magnitude), which underpins the impor-
tance of selecting the best-suited S-box with respect to the requirements of the application.
The remainder of this paper is organized as follows. Section 2 briey explains the AES
algorithm and discusses hardware implementation aspects. In Section 3 we overview different
implementation strategies for the AES S-box. The particular S-box implementations that we used
for our evaluation of area, delay, and power consumption are described in Section 4. Section 5
provides background information on the design ow and evaluation methodology. In Section 6 we
discuss our experimental results and we nally conclude in Section 7.
2 The Advanced Encryption Standard
In November 2001, after several years of public evaluation, the National Institute of Standards and
Technology (NIST) ofcially announced the algorithm for the new Federal Information Processing
Standard FIPS-197 [15], also called Advanced Encryption Standard (AES). The block cipher
Rijndael [4] was chosen from 15 submitted candidates and has thenceforward become the AES
algorithm.
The AES is a very exible algorithm suitable for implementation on many platforms in software
as well as in hardware. Its simplicity and symmetry properties facilitate optimization towards
different objectives such as high performance or low cost. The AES algorithm has a xed block
size of 128 bits. Each block is organized as a 4 4 matrix of bytes, referred to as State. The
FIPS-197 standard denes three different key lengths: 128, 192, and 256 bits. Similar to most
symmetric ciphers, the AES algorithm encrypts an input block by applying a round transformation
several times. Depending on the key length, the number of rounds is either 10, 12, or 14. The
round transformation modies the 128-bit State from its initial value (i.e. the plaintext) to obtain
the ciphertext after the last round. Each round consists of non-linear, linear, and key-dependent
transformations, which can all be described by means of algebraic operations over the nite eld
F
2
8 . These operations, called SubBytes, ShiftRows, MixColumns, and AddRoundKey, scramble
the bytes of the State either individually, row-wise, or column-wise. Before the rst round an initial
AddRoundKey is performed, while in the last round the MixColumns operation is omitted.
The SubBytes transformation substitutes each byte of the State independently. This byte sub-
stitution is dened by the so-called S-box, which can be expressed through arithmetic operations
in the nite elds F
2
and F
2
8 . More specically, it is composed of an inversion in F
2
8 followed
by an afne transformation. The afne transformation consists of a multiplication with a constant
polynomial over F
2
and addition of another constant polynomial. The SubBytes transformation is
the only non-linear function of the AES algorithm. Its implementation has a major impact on the
area, performance, and power consumption of an AES hardware module.
ShiftRows rotates each row of the State to the left using a specic offset. The offset equals the
row index (starting at 0), which means that the rst row is not rotated at all and the last row is
rotated by three bytes to the left.
MixColumns operates on columns of the State. Each column is interpreted as a polynomial
of degree 3 with coefcients from the eld F
2
8 . This polynomial is multiplied by a polynomial
with xed coefcients, and the result is reduced modulo g(t) ={1}t
4
+{1} (where {1} F
2
8 ). The
MixColumns operation is often expressed as a multiplication by a constant 4 4 matrix of F
2
8
elements with the input column (interpreted as four elements of F
2
8 ), yielding the respective output
column.
The three aforementioned transformations form the substitution permutation network of the
AES algorithm, wherein SubBytes represents the substitution part (to increase confusion) and
ShiftRows and MixColumns constitute the permutation part (increasing diffusion). AddRoundKey
simply combines the State with a round key by applying an XOR-operation over all 128 bits.
The KeySchedule transformation produces the 128-bit round keys, whereby the rst round key
is equal to the cipher key. All other round keys are computed from the previous round key by using
the S-box functionality and some constants referred to as Rcon. The decryption function recovers
the plaintext from a given ciphertext by executing the inverse round transformations (InvSubBytes,
InvShiftRows, InvMixColumns, and AddRoundKey) in reverse order. All round keys are also used
in reverse order.
2.1 Hardware Implementation Aspects
The AES is a exible algorithm well suited for implementation in hardware. A multitude of hard-
ware architectures are possible, which allows for optimization toward different requirements,
ranging from high performance to low power consumption and small silicon area. A considerable
literature exists that is devoted to efcient hardware implementation of the AES [3, 6, 7, 11, 16, 17,
21]. Depending on the target application, AES architectures can have a datapath width of between
8 and 128 bits. Additionally, it is possible to unroll several rounds and insert pipeline stages into
the design. However, to support different modes of operation like the CBC mode [4], often only
one round is realized in hardware and used repeatedly.
The width of the datapath determines the main characteristics (i.e. performance, area, power
consumption) of an AES implementation. Since the AES is byte-oriented, an 8-bit architecture with
a single S-box is the natural choice for applications where small area and low power dissipation are
crucial, e.g. smart cards or RFID tags. At the other end of the spectrum are 128-bit architectures
containing 16 S-boxes to compute the SubBytes function of a 128-bit data block in one pass. Due
to this massive parallelism, 128-bit architectures can reach high throughput rates at the expense
of large silicon area. 32-bit architectures with four S-boxes constitute a good compromise between
the two aforementioned extremes; they allow for much higher performance than 8-bit architectures
but demand only a fraction of the area of 128-bit implementations.
3 Implementation Strategies for the AES S-Box
All AES architectures sketched in the previous section have a common feature in that the SubBytes
transformation occupies a signicant portion of the overall silicon area. The size of SubBytes
is, in turn, determined by the number of S-boxes and their concrete implementation. Various
implementation options for the AES S-box have been investigated in the recent past, which has led
to an abundant literature [1, 2, 8, 10, 12, 13, 17, 20].
The SubBytes transformation substitutes all 16 bytes of the State independently using the
S-box. Furthermore, the S-box is also used in the AES key expansion. In software, the S-box
is typically realized in the form of a look-up table since inversion in the nite eld F
2
8 can
not be calculated efciently on general-purpose processors. In hardware, on the other hand, the
implementation of the S-box is directed by the desired trade-off between area, delay, and power
consumption. The most obvious implementation approach for the S-box takes the form of hardware
look-up tables [11]. However, since encryption and decryption require different tables, and each
table contains 2048 bits, the overall hardware cost of this approach is relatively high.
An implementation option related to standard cells is the use of ROM compilers to produce
hardware macros. For the technology that we used, a sufciently large ROM would require a
considerable amount of silicon area. The critical path delay would be similar to a hardware look-up
approach, but the power consumption of generated ROMs is about two to three orders of magnitude
higher
1
. Therefore, we do not consider the implementation of the S-box as ROM in this paper.
More sophisticated approaches calculate the S-box function in hardware using its algebraic
properties [4]. The focus of such implementations is the efcient realization of the inversion in
F
2
8 , which can be achieved by decomposing the nite eld into the sub-elds F
2
4 and F
2
2 . An
inversion in a nite eld of characteristic 2 can be carried out in different ways, depending on
the basis which is used to represent the eld elements [9]. The two most common types of bases
for F
2
m are the polynomial basis and the normal basis. A polynomial basis is a basis of the
form {1, ,
2
, . . . ,
m1
} where is a root of an irreducible polynomial p(t) of degree m with
coefcients from F
2
. On the other hand, a normal basis can be found by selecting a eld element
F
2
m such that the elements of the set {,
2
,
4
, . . . ,
2
m
1
} are linearly independent.
A third approach for implementing the AES S-box was proposed by Bertoni et al. in [1].
By using an intermediate one-hot encoding of the input, arbitrary logic functions (including
cryptographic S-boxes) can be realized with minimal power consumption. The main drawback of
this approach is that it results in relatively large silicon area.
4 Implementation Details
All AES S-box implementations analyzed in this paper can perform forward and inverse byte
substitution for encryption and decryption, respectively. We implemented the S-boxes either from
scratch or obtained the HDL descriptions from the authors of the respective publications. The
implementations examined consist solely of combinatorial logic, i.e. no pipeline stages have been
inserted since pipelining does not make sense when a feedback mode of operation like OFB or
CBC is used [7]. In the following we describe a total of eight different implementations of the
AES S-box which can be grouped into three basic categories: look-up implementations, calculating
implementations, and low-power implementations. Four of the eight S-box implementations are
illustrated in Figure 1.
The simplest design in our comparison is a straight-forward implementation of a hardware
look-up table [11]. The synthesizer transforms the behavioral description of the look-up table into
a mass of unstructured standard cells. This approach will be denoted as hw-lut. A modication of
1
Unfortunately, the exact performance gures for ROMs were not accessible for the technology we used.
Sin
Sout
enc / dec
Sin
Sout
Decoder
Permutation
Multiplexer
Combinational logic
1 0
GF(2
8
)
inversion
inverse affine
transformation
Sin
Sout
Sout
16x8-bit
LUT
16x8-bit
LUT
16x8-bit
LUT
Sin
. . .
...
hw-lut sub16-lut
bertoni hybrid-lut
affine
transformation
1 0
enc / dec
32-to-1
Sin[3..0] Sin[3..0]
Sin[7..4]
Sin[3..0]
enc / dec
. . .
...
Figure 1: Comparison of four S-box implementations
hw-lut is to use sub-tables in order to minimize switching activity in the look-up tables to reduce
power consumption. We examined such solutions with sub-tables of size 16, 32, 64, 128, and 256
bytes, but in this paper we only specify results for size 16 (sub16-lut).
Implementations which calculate the S-box transformation in hardware were rst proposed by
Wolkerstorfer et al. [20] and Satoh et al. [17]. The former approach decomposes the elements
of F
2
8 into polynomials over the sub-eld F
2
4 and performs inversion there. Our implementation of
this solution is denoted as wolkerstorfer. Satohs solution decomposes the eld elements further
into polynomials over the sub-eld F
2
2 , where inversion is a trivial swap of the lower and higher
bit of the representation. This implementation is referred to as satoh in the following. Both of
these approaches represent the eld elements by using a polynomial basis. Canright improved the
calculation of the S-box by switching the representation to a normal basis [2]. Like in Satohs
solution, the elements of F
2
8 are mapped to a polynomial over the sub-eld F
2
2 . This approach will
be denoted as canright.
A compromise between hardware look-up and calculation has also been examined. In this im-
plementation (denoted as hybrid-lut) only the inversion in F
2
8 is realized as look-up table. Since
the inversion is used for both encryption and decryption, the size of the look-up table is halved in
relation to the hw-lut approach. The afne and inverse afne transformations are performed via
logic circuits just as in the calculating implementations of wolkerstorfer, satoh, and canright.
The low-power approach of Bertoni et al. [1] uses a decode stage to convert the eight bits of the
input byte and the control bit which selects encryption or decryption into a one-hot representation
consisting of 2
9
= 512 bits. The substitution itself is just a rearrangement of these bits and can be
done efciently in hardware by a rewiring of lines as illustrated in Figure 1. Since two of the lines
always map to the same 8-bit result (one for encryption and one for decryption), these line pairs can
be combined with a logical OR to yield a one-hot decoded representation of the result consisting
of 256 bits. A subsequent encoder stage transforms this result back to an 8-bit binary value. Due
to this decoder-permute-encoder structure, there is only very little signal activity within the circuit
when the input changes, resulting in low power consumption. Note that the structure of Bertonis
approach makes it easily possible to introduce pipeline stages. However, it may be necessary to add
a large number of additional ip-ops when the pipeline stage is placed between the decoder and
encoder, i.e. on the one-hot encoded signal lines. These ip-ops will increase power consumption
considerably and can easily mitigate the low-power advantages of this solution. For design scenarios
where both power consumption and silicon area are of minor importance, Bertonis approach can
offer the best opportunity for reaching very high clock frequencies.
We tested two implementations of Bertonis approach: One implementation uses a decoder with
four stages as proposed in the original publication for minimal power consumption (bertoni). The
second implementation, denoted as bertoni-2stg, uses a different decoder structure with only two
stages in order to reduce the critical path of the circuit.
In the remainder of this paper we will refer to wolkerstorfer, satoh, and canright as calculating
implementations. We will denote hw-lut and hybrid-lut as look-up implementations, and sub16-
lut, bertoni, and bertoni-2stg as low-power implementations.
5 Design Flow and Evaluation Methodology
In contrast to our previous work [18] where we used a 0.35 m standard cell library from Austria-
microsystems, all results in this paper were obtained with the VST250 standard cells from Virtual
Silicon. These standard cells are built upon the 0.25 m process technology L250 of UMC, which
provides one poly-silicon layer and ve metal layers. The nominal supply voltage of the VST250
cell library is 2.5 V.
We implemented the eight S-box designs described in Section 4 in VHDL according to the
specications in the respective papers. In order to ensure a fair comparison and a common interface
for all implementations, we provided the input and output of each S-box with 8-bit registers. The
integration of the registers made it possible to optimize for area and delay during synthesis. The
logic synthesis was done using the Physically Knowledgeable Synthesis (PKS) tool from Cadence.
We varied the constraints for the delay time (i.e. maximum clock frequency) from the minimum
value to a value where the constraints could just be met. The delays given in Table 1 are the actual
delays of the synthesized circuit. Empty cells in the table indicate that the respective target delay
could not be achieved by the synthesizer.
After synthesis, the placement and routing of the standard cells was performed with the Cadence
tool First Encounter. We did not include I/O cells into the designs, i.e. we analyzed only the core of
the S-boxes consisting of standard cells and the power supply rings. During placement we used an
area utilization of 70%. All the gures in Table 1 are results from synthesis excluding the clock tree
for the input and output registers. After the routing step we integrated the layouts of the standard
cells into the design, which gave us the full layout in GDS2 format.
Target delay (ns)
Design Result
2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00
Act. delay (ns) 4.98 5.00 6.55 6.55 6.55
canright Area (GE) 496 400 303 303 303
Power (A) 1.78 1.78 1.81 1.81 1.81
Act. delay (ns) 5.93 6.55 6.99 6.99
satoh Area (GE) 438 409 385 385
Power (A) 2.00 1.73 1.51 1.51
Act. delay (ns) 4.93 5.94 6.48 7.51 7.51
wolkers-
Area (GE) 625 412 415 392 392
torfer
Power (A) 1.87 1.97 1.75 1.53 1.53
Act. delay (ns) 1.95 2.91 3.90 4.98 5.88 6.61 6.61 6.61
hw-lut Area (GE) 1545 1415 1351 1352 1302 1301 1301 1301
Power (A) 1.18 0.97 1.00 0.97 0.93 1.00 1.00 1.00
Act. delay (ns) 2.94 3.92 4.46 4.46 4.46 4.46 4.46
sub16-lut Area (GE) 2040 1979 1957 1957 1957 1957 1957
Power (A) 0.56 0.53 0.55 0.58 0.58 0.58 0.58
Act. delay (ns) 2.93 3.92 4.86 5.83 6.49 6.49 6.49
hybrid-
Area (GE) 1222 840 810 799 798 798 798
lut
Power (A) 1.34 1.02 0.98 0.95 0.98 0.98 0.98
Act. delay (ns) 1.86 2.90 3.31 3.31 3.31 3.31 3.31 3.31
bertoni Area (GE) 2016 1433 1399 1399 1399 1399 1399 1399
Power (A) 0.42 0.30 0.27 0.27 0.27 0.27 0.27 0.27
Act. delay (ns) 1.98 2.79 3.53 3.26 3.26 3.26 3.26 3.26
bertoni-
Area (GE) 1941 1446 1436 1421 1421 1421 1421 1421
2stg
Power (A) 0.42 0.32 0.31 0.33 0.33 0.33 0.33 0.33
Table 1: Synthesis results of the eight S-box designs depending on the target delay
We extracted a Spectre netlist from the layout using Assura RCX, where we only considered
resistors larger than 1 and capacitors larger than 1 pF. In contrast to our previous work [18], we
obtained the power consumption of the different S-box designs through simulation with Synopsys
NanoSim. All simulations were performed with BSIM3v3 transistor models characterized for the
UMC L250 technology and the built-in NanoSim models for resistors and capacitors. The results of
the NanoSim simulations shown in Table 1 represent the mean current consumption of the S-boxes
at a supply voltage of 2.5 V. We used a clock frequency of 50 MHz (i.e. new input values are
applied to the circuit with a period of 20 ns) and simulated all 256 possible input patterns.
6 Experimental Results
We synthesized all eight S-box implementations mentioned in Section 4 using the design ow
described previously. For each implementation several synthesis runs were carried out, whereby
we specied different target values for the maximum critical path delay, ranging from 2 ns to
9 ns. Table 1 summarizes the actual delay, the area of the synthesized design, and the mean power
consumption. We omitted the results of all synthesis runs where the timing constraints were not
met, i.e. when the actual delay was higher than the target delay.
0
500
1000
1500
2000
2500
1 2 3 4 5 6 7 8 9
Target value for critical path delay (ns)
A
r
e
a

(
G
E
)
sub16-lut bertoni bertoni-2stg hw-lut
hybrid-lut satoh wolkerstorfer canright
Figure 2: Area vs. critical path delay
Figure 2 shows the area of the eight S-box designs when synthesized for a specic critical path
delay. The area is given in gate equivalents (GE), calculated as total area divided by the size of a
2-input NAND with the lowest drive strength, which is the NAND20 cell of the library we used.
Amongst the three calculating implementations (at the bottom of the gure), canright is clearly
the best. It has the smallest size of all eight S-boxes, but suffers from a longer critical path than the
hardware look-up implementations and the low-power solutions. The calculating implementations
are smaller than the other two approaches because they make use of the algebraic structure of the
S-box to implement the substitution. On the other hand, this structure has a relatively long critical
path. The shortest critical path can be achieved with bertoni, but its size is about three times that
of canright. Look-up implementations ignore the algebraic structure of the S-box and just aim at a
straightforward realization of the boolean equations given by the input-output relation. Hence, the
synthesizer has a much higher degree of freedom for optimizing the circuit, which allows for a
shorter critical path at the expense of silicon area.
The low-power implementations also ignore the algebraic properties of the substitution and
simply implement the boolean equations of the input-output relation. However, they use a specic
structure (decode-permute-encode) to reduce signal activity. Although the critical path is similarly
short as for look-up implementations, the one-hot encoding requires more silicon area than the
look-up implementations. The sub16-lut approach also has a signicant area overhead introduced
by the address decoding of the sub-tables, which makes it the most costly solution in terms of silicon
area. Moreover, the address decode logic causes a longer critical path. As expected, the compromise
between hardware look-up and calculation (hybrid-lut) lies somewhere between hw-lut and the
calculating implementations with regard to both critical path delay and area.
Figure 3 shows the total power consumption plotted against the critical path delay. All power
values are normalized with respect to the power consumption of hw-lut for a delay of 5.0 ns. The
low-power S-boxes based on the approach of Bertoni (bertoni, bertoni-2stg) are the clear winners
of this comparison. The original implementation bertoni shows the best overall results among all
eight examined designs, closely followed by the modied version bertoni-2stg. Bertonis approach
is solely directed towards low power consumption with a minimal level of signal activity in the
circuit. The sub16-lut approach, on the other hand, tries to improve a straightforward look-up table
0
0,25
0,5
0,75
1
1,25
1,5
1 2 3 4 5 6 7 8 9
Target value for critical path delay (ns)
T
o
t
a
l

p
o
w
e
r

(
n
o
r
m
a
l
i
z
e
d
)
satoh wolkerstorfer canright hybrid-lut
hw-lut sub16-lut bertoni-2stg bertoni
Figure 3: Total power consumption vs. critical path delay
implementation (hw-lut) with low-power measures. However, sub16-lut requires almost twice as
much power as bertoni, while hw-lut consumes about three times more power. The hybrid-lut
approach requires roughly the same amount of power as hw-lut.
The power consumption of the calculating implementations is much higher than that of the
low-power and look-up versions. The algebraic evaluation of the S-box function in calculating
implementations causes a large number of internal nodes to transition even if only a few input bits
toggle. This behavior entails high signal activity and, in turn, high power consumption. In look-up
implementations a change of a few input bits affects the evaluation of all output bits separately. As
normally some output bits will remain unchanged, the signal activity within this particular path
is low, which limits the overall power consumption. The implementation of canright consumes
almost twice as much power as hw-lut, and roughly an order of magnitude more power than
bertoni. The other two calculating implementations, wolkerstorfer and satoh, have similar power
characteristics as canright.
0
250
500
750
1000
1250
1 2 3 4 5 6 7 8 9
Target value for critical path delay (ns)
(
P
o
w
e
r

x

A
r
e
a
)

n
o
r
m
a
l
i
z
e
d
satoh wolkerstorfer canright hybrid-lut
sub16-lut hw-lut bertoni bertoni-2stg
Figure 4: Power-area product vs. critical path delay
Figure 4 shows the results of the eight S-box implementations in terms of the power-area
product. This metric is particularly relevant for applications with a need for both small silicon area
and low power consumption, e.g. cryptographically enhanced RFID tags or sensor nodes.
Due to their large area requirements, hw-lut and sub16-lut have the worst power-area prod-
uct among all eight examined implementations. Also the calculating S-boxes show a relatively
bad power-area product, which is mainly caused by the high power consumption of the S-box
evaluation. All three calculating implementations have similar characteristics for relaxed critical
path conditions. Both satoh and wolkerstorfer also have similar properties for more stringent
constraints on the critical path, whereas canright becomes more and more advantageous for faster
designs. The hybrid-lut implementation is even slightly better than canright when synthesized for
a delay of 5 ns. However, hybrid-lut becomes very unattractive if the critical path delay needs to be
smaller. The low-power approach of bertoni achieves the best overall power-area product, closely
followed by bertoni-2stg.
The power-area products shown in Figure 4 differ from those in [18] because we used a different
standard cell library and a different approach for evaluating the power consumption. According to
our results, the calculating implementations are more attractive than the look-up implementations
and sub16-lut is the best look-up implementation for short critical paths. The low-power designs
achieve the best results for the power-area product in our study as well as in [18]. However, while
our study found slight advantages for bertoni, the results in [18] show bertoni-2stg as winner.
0
0,2
0,4
0,6
0,8
1
1,2
0 300 600 900 1200 1500 1800 2100
Area (GE)
T
o
t
a
l

p
o
w
e
r

(
n
o
r
m
a
l
i
z
e
d
)
satoh wolkerstorfer
canright hybrid-lut
hw-lut bertoni-2stg
bertoni sub16-lut
decreasing
critical path
delay
Figure 5: Total power consumption vs. area
Figure 5 illustrates the power consumption in relation to the required silicon area. In general,
the points further away from the point of origin represent synthesis results for shorter critical
path delays. The gure shows that calculating implementations tend to sacrice power efciency
to achieve higher speed. On the other hand, the low-power implementations trade silicon area
for a shorter critical path. The sub16-lut implementation shows similar behavior. The look-up
implementations hw-lut and hybrid-lut sacrice area as well as power efciency to roughly the
same degree.
In order to minimize the critical path delay, the synthesizer applies a number of optimization
techniques like using standard cells with higher drive strengths or the duplication of logic paths,
which causes considerable power consumption in circuits with high switching activity. Calculating
S-box implementations have an inherently high number of signal switches and, therefore, incur an
over-proportional increase in power consumption when reducing the critical path delay. Low-power
implementations, on the other hand, are characterized by little signal activity and, therefore, a
moderate increase in power consumption for shorter critical paths.
When compared to the results reported in [18] (which are based on a 0.35 m standard-cell
library), the silicon area and critical path delay gures correspond quite well to the current ones
obtained with the UMC 0.25 m technology. Regarding power consumption, we notice that the
current gures indicate a less dramatic difference among the examined S-box implementations as
those given in [18]. We attribute this discrepany to the different standard cell libraries and the
different power evaluation methods. While the results in [18] were obtained via estimations from
the synthesis tool, our current gures result from a much more accurate simulation of the placed
and routed designs using NanoSim. This, of course, has also led to slight differences in all other
metrics which include the power consumption results.
7 Conclusions
In this paper we examined eight AES S-box implementations which follow three different design
strategies. We analyzed and compared various cost metrics like critical path delay, silicon area, and
power consumption of these implementations based on synthesis runs with a 0.25 m CMOS
standard cell library. Our simulation results clearly show that the characteristics of the eight S-box
implementations differ signicantly. For example, the power consumption of the different S-boxes
varies by almost an order of magnitude, which underpins the importance of selecting the proper
S-box with respect to the requirements of the target application. We found that Canrights S-box
design is the best choice for applications where small silicon area is the main criterion (e.g. RFID
tags). Bertonis S-box is very well suited for applications with a demand for low power or energy
consumption, e.g. wireless sensor nodes. In addition, the Bertoni S-box also has the shortest critical
path, followed by the look-up implementations. While the results for the calculating implemen-
tations only apply to the AES S-box, the insights from the other two implementation strategies
(look-up except hybrid-lut and low-power) are also useful for other cryptographic S-boxes.
Acknowledgements
The authors would like to thank Johannes Wolkerstorfer and David Canright for providing the
HDL source code of several AES S-box implementations. The research described in this paper has
been supported by the Austrian Science Fund (FWF) under grant P16952N04, the FIT-IT initiative
of the Austrian Federal Ministry of Transport, Innovation, and Technology (project SNAP), and the
EPSRC under grant EP/E001556/1. The research described in this paper has also been supported, in
part, by the European Commission through the IST Programme under contract IST-2002-507932
ECRYPT. The information in this document reects only the authors views, is provided as is and
no guarantee or warranty is given that the information is t for any particular purpose. The user
thereof uses the information at its sole risk and liability.
References
[1] G. Bertoni, M. Macchetti, L. Negri, and P. Fragneto. Power-efcient ASIC synthesis of cryptographic
Sboxes. In Proceedings of the 14th ACM Great Lakes Symposium on VLSI (GLSVLSI 2004), pp. 277
281. ACM Press, 2004.
[2] D. Canright. A very compact S-Box for AES. In Cryptographic Hardware and Embedded Systems
CHES 2005, vol. 3659 of Lecture Notes in Computer Science, pp. 441455. Springer Verlag, 2005.
[3] P. Chodowiec and K. Gaj. Very compact FPGA implementation of the AES algorithm. In Crypto-
graphic Hardware and Embedded Systems CHES 2003, vol. 2779 of Lecture Notes in Computer
Science, pp. 319333. Springer Verlag, 2003.
[4] J. Daemen and V. Rijmen. The Design of Rijndael: AES The Advanced Encryption Standard. Springer
Verlag, 2002.
[5] M. Feldhofer, K. Lemke, E. Oswald, F.-X. Standaert, T. Wollinger, and J. Wolkerstorfer. State of the Art
in Hardware Architectures. ECRYPT deliverable D.VAM.2, available for download at http://www.
ecrypt.eu.org/documents/D.VAM.2-1.0.pdf, Sept. 2005.
[6] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen. AES implementation on a grain of sand. IEE Proceed-
ings Information Security, 152(1):1320, Oct. 2005.
[7] A. Hodjat, D. D. Hwang, B.-C. Lai, K. Tiri, and I. M. Verbauwhede. A 3.84 Gbits/s AES crypto co-
processor with modes of operation in a 0.18-m CMOS technology. In Proceedings of the 15th ACM
Great Lakes Symposium on VLSI (GLSVLSI 2005), pp. 351356. ACM Press, 2005.
[8] H. Li. A parallel S-box architecture for AES byte substitution. In Proceedings of the 2nd International
Conference on Communications, Circuits and Systems (ICCCAS 2004), vol. 1, pp. 13. IEEE, 2004.
[9] R. Lidl and H. Niederreiter. Finite Fields, vol. 20 of Encyclopedia of Mathematics and Its Applications.
Cambridge University Press, 1996.
[10] M. Macchetti and G. Bertoni. Hardware implementation of the Rijndael SBOX: A case study. ST
Journal of System Research, 0(0):8491, July 2003.
[11] M. McLoone and J. V. McCanny. High performance single-chip FPGA Rijndael algorithm implemen-
tations. In Cryptographic Hardware and Embedded Systems CHES 2001, vol. 2162 of Lecture Notes
in Computer Science, pp. 6576. Springer Verlag, 2001.
[12] N. Mentens, L. Batina, B. Preneel, and I. M. Verbauwhede. Systematic evaluation of compact hardware
implementations for the Rijndael S-box. In Topics in Cryptology CT-RSA 2005, vol. 3376 of Lecture
Notes in Computer Science, pp. 323333. Springer Verlag, 2005.
[13] S. Morioka and A. Satoh. An optimized S-Box circuit architecture for low power AES design. In Cryp-
tographic Hardware and Embedded Systems CHES 2002, vol. 2523 of Lecture Notes in Computer
Science, pp. 172186. Springer Verlag, 2002.
[14] National Institute of Standards and Technology (NIST). Data Encryption Standard (DES). Federal
Information Processing Standards (FIPS) Publication 46-3, Oct. 1999.
[15] National Institute of Standards and Technology (NIST). Advanced Encryption Standard (AES). Federal
Information Processing Standards (FIPS) Publication 197, Nov. 2001.
[16] N. Pramstaller and J. Wolkerstorfer. A universal and efcient AES co-processor for eld programmable
logic arrays. In Field Programmable Logic and Application FPL 2004, vol. 3203 of Lecture Notes
in Computer Science, pp. 565574. Springer Verlag, 2004.
[17] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A compact Rijndael hardware architecture with
S-Box optimization. In Advances in Cryptology ASIACRYPT 2001, vol. 2248 of Lecture Notes in
Computer Science, pp. 239254. Springer Verlag, 2001.
[18] S. Tillich, M. Feldhofer, and J. Grosch adl. Area, delay, and power characteristics of standard-cell
implementations of the AES S-box. In Embedded Computer Systems: Architectures, Modeling, and
Simulation SAMOS 2006, vol. 4017 of Lecture Notes in Computer Science, pp. 457466. Springer
Verlag, 2006.
[19] S. Tillich and J. Grosch adl. Instruction set extensions for efcient AES implementation on 32-bit
processors. In Cryptographic Hardware and Embedded Systems CHES 2006, vol. 4249 of Lecture
Notes in Computer Science, pp. 270284. Springer Verlag, 2006.
[20] J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC implementation of the AES SBoxes. In
Topics in Cryptology CT-RSA 2002, vol. 2271 of Lecture Notes in Computer Science, pp. 6778.
Springer Verlag, 2002.
[21] X. Zhang and K. K. Parhi. High-speed VLSI architectures for the AES algorithm. IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, 12(9):957967, Sept. 2004.

You might also like