Cryptographic substitution boxes (s-boxes) are an integral part of modern block ciphers. This paper examines different standard-cell implementations of the AES S-box. Results show that the timing, area, and power properties of the different realizations can vary by up to almost an order of magnitude.
Original Description:
Original Title
Area, Delay, and Power Characteristics of Standard-Cell.pdf
Cryptographic substitution boxes (s-boxes) are an integral part of modern block ciphers. This paper examines different standard-cell implementations of the AES S-box. Results show that the timing, area, and power properties of the different realizations can vary by up to almost an order of magnitude.
Cryptographic substitution boxes (s-boxes) are an integral part of modern block ciphers. This paper examines different standard-cell implementations of the AES S-box. Results show that the timing, area, and power properties of the different realizations can vary by up to almost an order of magnitude.
Area, Delay, and Power Characteristics of Standard-Cell
Implementations of the AES S-Box
Stefan Tillich, Martin Feldhofer, Thomas Popp Institute for Applied Information Processing and Communications Graz University of Technology, Inffeldgasse 16a, A8010 Graz, Austria {stillich,mfeldhof,tpopp}@iaik.tugraz.at Johann Grosch adl University of Bristol, Department of Computer Science Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, U.K. johann.groszschaedl@cs.bris.ac.uk Abstract Cryptographic substitution boxes (S-boxes) are an integral part of modern block ciphers like the Advanced Encryption Standard (AES). There exists a rich literature devoted to the efcient implementation of cryptographic S-boxes, wherein hardware designs for FPGAs and standard cells received particular attention. In this paper we present a comprehensive study of different standard-cell implementations of the AES S-box with respect to timing (i.e. critical path), silicon area, power consumption, and combinations of these cost metrics. We examine implementations which exploit the mathematical properties of the AES S-box, constructions based on hardware look-up tables, and dedicated low-power solutions. Our results show that the timing, area, and power properties of the different S-box realizations can vary by up to almost an order of magnitude. In terms of area and area-delay product, the best choice are implementations which calculate the S-box output. On the other hand, the hardware look-up solutions are characterized by the shortest critical path. The dedicated low-power implementations do not only reduce power consumption by a large degree, but they also show good timing properties and offer the best power-delay and power-area product, respectively. 1 Introduction The Internet of the 21st century will consist of billions of non-traditional computing systems like cell phones, PDAs, sensor nodes, and other mobile devices (gadgets) with wireless networking capability. Wireless networking, along with the fact that many of these devices (e.g. sensor nodes) are easily accessible, have raised a number of security concerns. Sophisticated security protocols, in combination with well-established cryptographic primitives, can ensure privacy and integrity of communication over insecure networks. Consequently, there is an increasing demand to implement cryptographic algorithms on resource-limited embedded devices like cell phones, PDAs, or sensor nodes. Even some extremely constrained systems like Radio Frequency Identication (RFID) tags are required to perform cryptographic operations. The Advanced Encryption Standard (AES), which was announced by the NIST in 2001, denes one of the most important symmetric ciphers for the long-term future [15]. The AES algorithm Journal of Signal Processing Systems, vol. 50, no. 2, pp. 251261, Feb. 2008. Springer-Verlag, 2008. A preliminary version of this paper was published in the proceedings of SAMOS 2006, LNCS 4017, pp. 457466. is a variant of the Rijndael cipher [4] and can be implemented efciently in both hardware and software. Common AES hardware implementations take the form of stand-alone ASICs and cryptographic coprocessors for system-on-chip (SoC) integration. In addition, hardware/software co-design techniques like extending the instruction set of a general-purpose processor have been investigated in the recent past [19]. Due to the high performance of modern microprocessors, AES software can reach throughput rates that are sufcient for most applications. Therefore, hardware implementations of the AES algorithm are mainly important for high-end server systems with extreme performance requirements and for embedded devices with a demand for low power consumption and small silicon area. Most of the published AES hardware designs focus on high speed and high throughput for implementation in FPGAs [3, 11, 16]. In addition, some ASIC implementations have been reported in the literature. For example, Hodjat et al. developed a 3.84 Gbits/s AES crypto coprocessor with modes-of-operation support based on a 0.18 m CMOS technology [7]. Their design features a 128-bit datapath and encrypts a block of data in 11 clock cycles. A completely different design approach is necessary when optimizing AES hardware for low power consumption or small silicon area. Feldhofer et al. introduced an AES implementation suited for passively-powered devices like RFID tags [6]. It comprises an 8-bit datapath which occupies an area of 3,595 gates (including registers and control logic) when synthesized using a 0.35 m standard cell library. These results show that the AES algorithm allows for a wide range of trade-offs between performance, power consumption, and hardware cost [5]. Symmetric ciphers like the AES require non-linear functions in order to resist linear crypt- analysis. Substitution is a common function for introducing non-linearity. A substitution function, generally referred to as S-box, can be realized in form of an arbitrary mapping from input bits to output bits (e.g. DES [14]) or via algebraic operations (e.g. AES). Different cipher algorithms use different numbers of S-boxes. For example, DES uses eight S-boxes which map six to four bits, while AES employs a single S-box which is a bijective mapping from eight to eight bits. The AES algorithm makes use of its S-box in the SubBytes round transformation and the key expansion [4]. From a mathematical point of view, the AES S-box is dened as an inversion in the nite eld F 2 8 with a specic irreducible polynomial [9], followed by an afne transformation. The inverse S-box, which is required for the InvSubBytes round transformation for decryption, is simply the inverse of the afne transformation, followed by an inversion in F 2 8 . The S-box is a costly and performance-critical building block of the AES algorithm. Results from previous work [7, 21] show that the S-box lies on the critical path of many AES architectures and, hence, limits the maximum clock frequency. In addition, the S-box also impacts area and power consumption of AES hardware [17, 13]. Therefore, the AES S-box has been a subject of intensive research in recent years, which has led to a rich literature on efcient S-box design and imple- mentation. The proposed designs can be roughly categorized into S-boxes that contain optimized circuits for arithmetic in F 2 8 [2, 17, 20], constructions using hardware look-up tables [8, 11], and dedicated low-power solutions [1], all of which have their specic advantages and disadvantages with respect to area, delay, and power consumption. Although most papers introducing new S-box designs provide implementation results and discuss related work, it is generally difcult to compare the different design approaches since, for example, the implementations may have been produced using different design ows and tools, different standard cell libraries, or different optimizations (speed, area) for the synthesis process. In this paper we analyze and compare silicon area, critical path delay, and power consumption characteristics of the most common standard-cell designs of the AES S-box in a uniform and coherent way. We consider in our study designs which exploit the mathematical properties of the S-box, constructions based on hardware look-up tables, and dedicated low-power solutions. In contrast to our previous work [18] where we used a 0.35 m standard-cell library to evaluate different S-box designs, we conducted the present study on basis of a more modern 0.25 m process technology in order to provide practical insights and results that are closer to the state-of-the-art in VLSI manufacturing. We put similar effort into optimizing each of the evaluated S-box designs to ensure a fair comparison. Our results show that the area, delay, and power gures of the different S-box designs vary signicantly (up to almost an order of magnitude), which underpins the impor- tance of selecting the best-suited S-box with respect to the requirements of the application. The remainder of this paper is organized as follows. Section 2 briey explains the AES algorithm and discusses hardware implementation aspects. In Section 3 we overview different implementation strategies for the AES S-box. The particular S-box implementations that we used for our evaluation of area, delay, and power consumption are described in Section 4. Section 5 provides background information on the design ow and evaluation methodology. In Section 6 we discuss our experimental results and we nally conclude in Section 7. 2 The Advanced Encryption Standard In November 2001, after several years of public evaluation, the National Institute of Standards and Technology (NIST) ofcially announced the algorithm for the new Federal Information Processing Standard FIPS-197 [15], also called Advanced Encryption Standard (AES). The block cipher Rijndael [4] was chosen from 15 submitted candidates and has thenceforward become the AES algorithm. The AES is a very exible algorithm suitable for implementation on many platforms in software as well as in hardware. Its simplicity and symmetry properties facilitate optimization towards different objectives such as high performance or low cost. The AES algorithm has a xed block size of 128 bits. Each block is organized as a 4 4 matrix of bytes, referred to as State. The FIPS-197 standard denes three different key lengths: 128, 192, and 256 bits. Similar to most symmetric ciphers, the AES algorithm encrypts an input block by applying a round transformation several times. Depending on the key length, the number of rounds is either 10, 12, or 14. The round transformation modies the 128-bit State from its initial value (i.e. the plaintext) to obtain the ciphertext after the last round. Each round consists of non-linear, linear, and key-dependent transformations, which can all be described by means of algebraic operations over the nite eld F 2 8 . These operations, called SubBytes, ShiftRows, MixColumns, and AddRoundKey, scramble the bytes of the State either individually, row-wise, or column-wise. Before the rst round an initial AddRoundKey is performed, while in the last round the MixColumns operation is omitted. The SubBytes transformation substitutes each byte of the State independently. This byte sub- stitution is dened by the so-called S-box, which can be expressed through arithmetic operations in the nite elds F 2 and F 2 8 . More specically, it is composed of an inversion in F 2 8 followed by an afne transformation. The afne transformation consists of a multiplication with a constant polynomial over F 2 and addition of another constant polynomial. The SubBytes transformation is the only non-linear function of the AES algorithm. Its implementation has a major impact on the area, performance, and power consumption of an AES hardware module. ShiftRows rotates each row of the State to the left using a specic offset. The offset equals the row index (starting at 0), which means that the rst row is not rotated at all and the last row is rotated by three bytes to the left. MixColumns operates on columns of the State. Each column is interpreted as a polynomial of degree 3 with coefcients from the eld F 2 8 . This polynomial is multiplied by a polynomial with xed coefcients, and the result is reduced modulo g(t) ={1}t 4 +{1} (where {1} F 2 8 ). The MixColumns operation is often expressed as a multiplication by a constant 4 4 matrix of F 2 8 elements with the input column (interpreted as four elements of F 2 8 ), yielding the respective output column. The three aforementioned transformations form the substitution permutation network of the AES algorithm, wherein SubBytes represents the substitution part (to increase confusion) and ShiftRows and MixColumns constitute the permutation part (increasing diffusion). AddRoundKey simply combines the State with a round key by applying an XOR-operation over all 128 bits. The KeySchedule transformation produces the 128-bit round keys, whereby the rst round key is equal to the cipher key. All other round keys are computed from the previous round key by using the S-box functionality and some constants referred to as Rcon. The decryption function recovers the plaintext from a given ciphertext by executing the inverse round transformations (InvSubBytes, InvShiftRows, InvMixColumns, and AddRoundKey) in reverse order. All round keys are also used in reverse order. 2.1 Hardware Implementation Aspects The AES is a exible algorithm well suited for implementation in hardware. A multitude of hard- ware architectures are possible, which allows for optimization toward different requirements, ranging from high performance to low power consumption and small silicon area. A considerable literature exists that is devoted to efcient hardware implementation of the AES [3, 6, 7, 11, 16, 17, 21]. Depending on the target application, AES architectures can have a datapath width of between 8 and 128 bits. Additionally, it is possible to unroll several rounds and insert pipeline stages into the design. However, to support different modes of operation like the CBC mode [4], often only one round is realized in hardware and used repeatedly. The width of the datapath determines the main characteristics (i.e. performance, area, power consumption) of an AES implementation. Since the AES is byte-oriented, an 8-bit architecture with a single S-box is the natural choice for applications where small area and low power dissipation are crucial, e.g. smart cards or RFID tags. At the other end of the spectrum are 128-bit architectures containing 16 S-boxes to compute the SubBytes function of a 128-bit data block in one pass. Due to this massive parallelism, 128-bit architectures can reach high throughput rates at the expense of large silicon area. 32-bit architectures with four S-boxes constitute a good compromise between the two aforementioned extremes; they allow for much higher performance than 8-bit architectures but demand only a fraction of the area of 128-bit implementations. 3 Implementation Strategies for the AES S-Box All AES architectures sketched in the previous section have a common feature in that the SubBytes transformation occupies a signicant portion of the overall silicon area. The size of SubBytes is, in turn, determined by the number of S-boxes and their concrete implementation. Various implementation options for the AES S-box have been investigated in the recent past, which has led to an abundant literature [1, 2, 8, 10, 12, 13, 17, 20]. The SubBytes transformation substitutes all 16 bytes of the State independently using the S-box. Furthermore, the S-box is also used in the AES key expansion. In software, the S-box is typically realized in the form of a look-up table since inversion in the nite eld F 2 8 can not be calculated efciently on general-purpose processors. In hardware, on the other hand, the implementation of the S-box is directed by the desired trade-off between area, delay, and power consumption. The most obvious implementation approach for the S-box takes the form of hardware look-up tables [11]. However, since encryption and decryption require different tables, and each table contains 2048 bits, the overall hardware cost of this approach is relatively high. An implementation option related to standard cells is the use of ROM compilers to produce hardware macros. For the technology that we used, a sufciently large ROM would require a considerable amount of silicon area. The critical path delay would be similar to a hardware look-up approach, but the power consumption of generated ROMs is about two to three orders of magnitude higher 1 . Therefore, we do not consider the implementation of the S-box as ROM in this paper. More sophisticated approaches calculate the S-box function in hardware using its algebraic properties [4]. The focus of such implementations is the efcient realization of the inversion in F 2 8 , which can be achieved by decomposing the nite eld into the sub-elds F 2 4 and F 2 2 . An inversion in a nite eld of characteristic 2 can be carried out in different ways, depending on the basis which is used to represent the eld elements [9]. The two most common types of bases for F 2 m are the polynomial basis and the normal basis. A polynomial basis is a basis of the form {1, , 2 , . . . , m1 } where is a root of an irreducible polynomial p(t) of degree m with coefcients from F 2 . On the other hand, a normal basis can be found by selecting a eld element F 2 m such that the elements of the set {, 2 , 4 , . . . , 2 m 1 } are linearly independent. A third approach for implementing the AES S-box was proposed by Bertoni et al. in [1]. By using an intermediate one-hot encoding of the input, arbitrary logic functions (including cryptographic S-boxes) can be realized with minimal power consumption. The main drawback of this approach is that it results in relatively large silicon area. 4 Implementation Details All AES S-box implementations analyzed in this paper can perform forward and inverse byte substitution for encryption and decryption, respectively. We implemented the S-boxes either from scratch or obtained the HDL descriptions from the authors of the respective publications. The implementations examined consist solely of combinatorial logic, i.e. no pipeline stages have been inserted since pipelining does not make sense when a feedback mode of operation like OFB or CBC is used [7]. In the following we describe a total of eight different implementations of the AES S-box which can be grouped into three basic categories: look-up implementations, calculating implementations, and low-power implementations. Four of the eight S-box implementations are illustrated in Figure 1. The simplest design in our comparison is a straight-forward implementation of a hardware look-up table [11]. The synthesizer transforms the behavioral description of the look-up table into a mass of unstructured standard cells. This approach will be denoted as hw-lut. A modication of 1 Unfortunately, the exact performance gures for ROMs were not accessible for the technology we used. Sin Sout enc / dec Sin Sout Decoder Permutation Multiplexer Combinational logic 1 0 GF(2 8 ) inversion inverse affine transformation Sin Sout Sout 16x8-bit LUT 16x8-bit LUT 16x8-bit LUT Sin . . . ... hw-lut sub16-lut bertoni hybrid-lut affine transformation 1 0 enc / dec 32-to-1 Sin[3..0] Sin[3..0] Sin[7..4] Sin[3..0] enc / dec . . . ... Figure 1: Comparison of four S-box implementations hw-lut is to use sub-tables in order to minimize switching activity in the look-up tables to reduce power consumption. We examined such solutions with sub-tables of size 16, 32, 64, 128, and 256 bytes, but in this paper we only specify results for size 16 (sub16-lut). Implementations which calculate the S-box transformation in hardware were rst proposed by Wolkerstorfer et al. [20] and Satoh et al. [17]. The former approach decomposes the elements of F 2 8 into polynomials over the sub-eld F 2 4 and performs inversion there. Our implementation of this solution is denoted as wolkerstorfer. Satohs solution decomposes the eld elements further into polynomials over the sub-eld F 2 2 , where inversion is a trivial swap of the lower and higher bit of the representation. This implementation is referred to as satoh in the following. Both of these approaches represent the eld elements by using a polynomial basis. Canright improved the calculation of the S-box by switching the representation to a normal basis [2]. Like in Satohs solution, the elements of F 2 8 are mapped to a polynomial over the sub-eld F 2 2 . This approach will be denoted as canright. A compromise between hardware look-up and calculation has also been examined. In this im- plementation (denoted as hybrid-lut) only the inversion in F 2 8 is realized as look-up table. Since the inversion is used for both encryption and decryption, the size of the look-up table is halved in relation to the hw-lut approach. The afne and inverse afne transformations are performed via logic circuits just as in the calculating implementations of wolkerstorfer, satoh, and canright. The low-power approach of Bertoni et al. [1] uses a decode stage to convert the eight bits of the input byte and the control bit which selects encryption or decryption into a one-hot representation consisting of 2 9 = 512 bits. The substitution itself is just a rearrangement of these bits and can be done efciently in hardware by a rewiring of lines as illustrated in Figure 1. Since two of the lines always map to the same 8-bit result (one for encryption and one for decryption), these line pairs can be combined with a logical OR to yield a one-hot decoded representation of the result consisting of 256 bits. A subsequent encoder stage transforms this result back to an 8-bit binary value. Due to this decoder-permute-encoder structure, there is only very little signal activity within the circuit when the input changes, resulting in low power consumption. Note that the structure of Bertonis approach makes it easily possible to introduce pipeline stages. However, it may be necessary to add a large number of additional ip-ops when the pipeline stage is placed between the decoder and encoder, i.e. on the one-hot encoded signal lines. These ip-ops will increase power consumption considerably and can easily mitigate the low-power advantages of this solution. For design scenarios where both power consumption and silicon area are of minor importance, Bertonis approach can offer the best opportunity for reaching very high clock frequencies. We tested two implementations of Bertonis approach: One implementation uses a decoder with four stages as proposed in the original publication for minimal power consumption (bertoni). The second implementation, denoted as bertoni-2stg, uses a different decoder structure with only two stages in order to reduce the critical path of the circuit. In the remainder of this paper we will refer to wolkerstorfer, satoh, and canright as calculating implementations. We will denote hw-lut and hybrid-lut as look-up implementations, and sub16- lut, bertoni, and bertoni-2stg as low-power implementations. 5 Design Flow and Evaluation Methodology In contrast to our previous work [18] where we used a 0.35 m standard cell library from Austria- microsystems, all results in this paper were obtained with the VST250 standard cells from Virtual Silicon. These standard cells are built upon the 0.25 m process technology L250 of UMC, which provides one poly-silicon layer and ve metal layers. The nominal supply voltage of the VST250 cell library is 2.5 V. We implemented the eight S-box designs described in Section 4 in VHDL according to the specications in the respective papers. In order to ensure a fair comparison and a common interface for all implementations, we provided the input and output of each S-box with 8-bit registers. The integration of the registers made it possible to optimize for area and delay during synthesis. The logic synthesis was done using the Physically Knowledgeable Synthesis (PKS) tool from Cadence. We varied the constraints for the delay time (i.e. maximum clock frequency) from the minimum value to a value where the constraints could just be met. The delays given in Table 1 are the actual delays of the synthesized circuit. Empty cells in the table indicate that the respective target delay could not be achieved by the synthesizer. After synthesis, the placement and routing of the standard cells was performed with the Cadence tool First Encounter. We did not include I/O cells into the designs, i.e. we analyzed only the core of the S-boxes consisting of standard cells and the power supply rings. During placement we used an area utilization of 70%. All the gures in Table 1 are results from synthesis excluding the clock tree for the input and output registers. After the routing step we integrated the layouts of the standard cells into the design, which gave us the full layout in GDS2 format. Target delay (ns) Design Result 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 Act. delay (ns) 4.98 5.00 6.55 6.55 6.55 canright Area (GE) 496 400 303 303 303 Power (A) 1.78 1.78 1.81 1.81 1.81 Act. delay (ns) 5.93 6.55 6.99 6.99 satoh Area (GE) 438 409 385 385 Power (A) 2.00 1.73 1.51 1.51 Act. delay (ns) 4.93 5.94 6.48 7.51 7.51 wolkers- Area (GE) 625 412 415 392 392 torfer Power (A) 1.87 1.97 1.75 1.53 1.53 Act. delay (ns) 1.95 2.91 3.90 4.98 5.88 6.61 6.61 6.61 hw-lut Area (GE) 1545 1415 1351 1352 1302 1301 1301 1301 Power (A) 1.18 0.97 1.00 0.97 0.93 1.00 1.00 1.00 Act. delay (ns) 2.94 3.92 4.46 4.46 4.46 4.46 4.46 sub16-lut Area (GE) 2040 1979 1957 1957 1957 1957 1957 Power (A) 0.56 0.53 0.55 0.58 0.58 0.58 0.58 Act. delay (ns) 2.93 3.92 4.86 5.83 6.49 6.49 6.49 hybrid- Area (GE) 1222 840 810 799 798 798 798 lut Power (A) 1.34 1.02 0.98 0.95 0.98 0.98 0.98 Act. delay (ns) 1.86 2.90 3.31 3.31 3.31 3.31 3.31 3.31 bertoni Area (GE) 2016 1433 1399 1399 1399 1399 1399 1399 Power (A) 0.42 0.30 0.27 0.27 0.27 0.27 0.27 0.27 Act. delay (ns) 1.98 2.79 3.53 3.26 3.26 3.26 3.26 3.26 bertoni- Area (GE) 1941 1446 1436 1421 1421 1421 1421 1421 2stg Power (A) 0.42 0.32 0.31 0.33 0.33 0.33 0.33 0.33 Table 1: Synthesis results of the eight S-box designs depending on the target delay We extracted a Spectre netlist from the layout using Assura RCX, where we only considered resistors larger than 1 and capacitors larger than 1 pF. In contrast to our previous work [18], we obtained the power consumption of the different S-box designs through simulation with Synopsys NanoSim. All simulations were performed with BSIM3v3 transistor models characterized for the UMC L250 technology and the built-in NanoSim models for resistors and capacitors. The results of the NanoSim simulations shown in Table 1 represent the mean current consumption of the S-boxes at a supply voltage of 2.5 V. We used a clock frequency of 50 MHz (i.e. new input values are applied to the circuit with a period of 20 ns) and simulated all 256 possible input patterns. 6 Experimental Results We synthesized all eight S-box implementations mentioned in Section 4 using the design ow described previously. For each implementation several synthesis runs were carried out, whereby we specied different target values for the maximum critical path delay, ranging from 2 ns to 9 ns. Table 1 summarizes the actual delay, the area of the synthesized design, and the mean power consumption. We omitted the results of all synthesis runs where the timing constraints were not met, i.e. when the actual delay was higher than the target delay. 0 500 1000 1500 2000 2500 1 2 3 4 5 6 7 8 9 Target value for critical path delay (ns) A r e a
( G E ) sub16-lut bertoni bertoni-2stg hw-lut hybrid-lut satoh wolkerstorfer canright Figure 2: Area vs. critical path delay Figure 2 shows the area of the eight S-box designs when synthesized for a specic critical path delay. The area is given in gate equivalents (GE), calculated as total area divided by the size of a 2-input NAND with the lowest drive strength, which is the NAND20 cell of the library we used. Amongst the three calculating implementations (at the bottom of the gure), canright is clearly the best. It has the smallest size of all eight S-boxes, but suffers from a longer critical path than the hardware look-up implementations and the low-power solutions. The calculating implementations are smaller than the other two approaches because they make use of the algebraic structure of the S-box to implement the substitution. On the other hand, this structure has a relatively long critical path. The shortest critical path can be achieved with bertoni, but its size is about three times that of canright. Look-up implementations ignore the algebraic structure of the S-box and just aim at a straightforward realization of the boolean equations given by the input-output relation. Hence, the synthesizer has a much higher degree of freedom for optimizing the circuit, which allows for a shorter critical path at the expense of silicon area. The low-power implementations also ignore the algebraic properties of the substitution and simply implement the boolean equations of the input-output relation. However, they use a specic structure (decode-permute-encode) to reduce signal activity. Although the critical path is similarly short as for look-up implementations, the one-hot encoding requires more silicon area than the look-up implementations. The sub16-lut approach also has a signicant area overhead introduced by the address decoding of the sub-tables, which makes it the most costly solution in terms of silicon area. Moreover, the address decode logic causes a longer critical path. As expected, the compromise between hardware look-up and calculation (hybrid-lut) lies somewhere between hw-lut and the calculating implementations with regard to both critical path delay and area. Figure 3 shows the total power consumption plotted against the critical path delay. All power values are normalized with respect to the power consumption of hw-lut for a delay of 5.0 ns. The low-power S-boxes based on the approach of Bertoni (bertoni, bertoni-2stg) are the clear winners of this comparison. The original implementation bertoni shows the best overall results among all eight examined designs, closely followed by the modied version bertoni-2stg. Bertonis approach is solely directed towards low power consumption with a minimal level of signal activity in the circuit. The sub16-lut approach, on the other hand, tries to improve a straightforward look-up table 0 0,25 0,5 0,75 1 1,25 1,5 1 2 3 4 5 6 7 8 9 Target value for critical path delay (ns) T o t a l
p o w e r
( n o r m a l i z e d ) satoh wolkerstorfer canright hybrid-lut hw-lut sub16-lut bertoni-2stg bertoni Figure 3: Total power consumption vs. critical path delay implementation (hw-lut) with low-power measures. However, sub16-lut requires almost twice as much power as bertoni, while hw-lut consumes about three times more power. The hybrid-lut approach requires roughly the same amount of power as hw-lut. The power consumption of the calculating implementations is much higher than that of the low-power and look-up versions. The algebraic evaluation of the S-box function in calculating implementations causes a large number of internal nodes to transition even if only a few input bits toggle. This behavior entails high signal activity and, in turn, high power consumption. In look-up implementations a change of a few input bits affects the evaluation of all output bits separately. As normally some output bits will remain unchanged, the signal activity within this particular path is low, which limits the overall power consumption. The implementation of canright consumes almost twice as much power as hw-lut, and roughly an order of magnitude more power than bertoni. The other two calculating implementations, wolkerstorfer and satoh, have similar power characteristics as canright. 0 250 500 750 1000 1250 1 2 3 4 5 6 7 8 9 Target value for critical path delay (ns) ( P o w e r
x
A r e a )
n o r m a l i z e d satoh wolkerstorfer canright hybrid-lut sub16-lut hw-lut bertoni bertoni-2stg Figure 4: Power-area product vs. critical path delay Figure 4 shows the results of the eight S-box implementations in terms of the power-area product. This metric is particularly relevant for applications with a need for both small silicon area and low power consumption, e.g. cryptographically enhanced RFID tags or sensor nodes. Due to their large area requirements, hw-lut and sub16-lut have the worst power-area prod- uct among all eight examined implementations. Also the calculating S-boxes show a relatively bad power-area product, which is mainly caused by the high power consumption of the S-box evaluation. All three calculating implementations have similar characteristics for relaxed critical path conditions. Both satoh and wolkerstorfer also have similar properties for more stringent constraints on the critical path, whereas canright becomes more and more advantageous for faster designs. The hybrid-lut implementation is even slightly better than canright when synthesized for a delay of 5 ns. However, hybrid-lut becomes very unattractive if the critical path delay needs to be smaller. The low-power approach of bertoni achieves the best overall power-area product, closely followed by bertoni-2stg. The power-area products shown in Figure 4 differ from those in [18] because we used a different standard cell library and a different approach for evaluating the power consumption. According to our results, the calculating implementations are more attractive than the look-up implementations and sub16-lut is the best look-up implementation for short critical paths. The low-power designs achieve the best results for the power-area product in our study as well as in [18]. However, while our study found slight advantages for bertoni, the results in [18] show bertoni-2stg as winner. 0 0,2 0,4 0,6 0,8 1 1,2 0 300 600 900 1200 1500 1800 2100 Area (GE) T o t a l
p o w e r
( n o r m a l i z e d ) satoh wolkerstorfer canright hybrid-lut hw-lut bertoni-2stg bertoni sub16-lut decreasing critical path delay Figure 5: Total power consumption vs. area Figure 5 illustrates the power consumption in relation to the required silicon area. In general, the points further away from the point of origin represent synthesis results for shorter critical path delays. The gure shows that calculating implementations tend to sacrice power efciency to achieve higher speed. On the other hand, the low-power implementations trade silicon area for a shorter critical path. The sub16-lut implementation shows similar behavior. The look-up implementations hw-lut and hybrid-lut sacrice area as well as power efciency to roughly the same degree. In order to minimize the critical path delay, the synthesizer applies a number of optimization techniques like using standard cells with higher drive strengths or the duplication of logic paths, which causes considerable power consumption in circuits with high switching activity. Calculating S-box implementations have an inherently high number of signal switches and, therefore, incur an over-proportional increase in power consumption when reducing the critical path delay. Low-power implementations, on the other hand, are characterized by little signal activity and, therefore, a moderate increase in power consumption for shorter critical paths. When compared to the results reported in [18] (which are based on a 0.35 m standard-cell library), the silicon area and critical path delay gures correspond quite well to the current ones obtained with the UMC 0.25 m technology. Regarding power consumption, we notice that the current gures indicate a less dramatic difference among the examined S-box implementations as those given in [18]. We attribute this discrepany to the different standard cell libraries and the different power evaluation methods. While the results in [18] were obtained via estimations from the synthesis tool, our current gures result from a much more accurate simulation of the placed and routed designs using NanoSim. This, of course, has also led to slight differences in all other metrics which include the power consumption results. 7 Conclusions In this paper we examined eight AES S-box implementations which follow three different design strategies. We analyzed and compared various cost metrics like critical path delay, silicon area, and power consumption of these implementations based on synthesis runs with a 0.25 m CMOS standard cell library. Our simulation results clearly show that the characteristics of the eight S-box implementations differ signicantly. For example, the power consumption of the different S-boxes varies by almost an order of magnitude, which underpins the importance of selecting the proper S-box with respect to the requirements of the target application. We found that Canrights S-box design is the best choice for applications where small silicon area is the main criterion (e.g. RFID tags). Bertonis S-box is very well suited for applications with a demand for low power or energy consumption, e.g. wireless sensor nodes. In addition, the Bertoni S-box also has the shortest critical path, followed by the look-up implementations. While the results for the calculating implemen- tations only apply to the AES S-box, the insights from the other two implementation strategies (look-up except hybrid-lut and low-power) are also useful for other cryptographic S-boxes. Acknowledgements The authors would like to thank Johannes Wolkerstorfer and David Canright for providing the HDL source code of several AES S-box implementations. The research described in this paper has been supported by the Austrian Science Fund (FWF) under grant P16952N04, the FIT-IT initiative of the Austrian Federal Ministry of Transport, Innovation, and Technology (project SNAP), and the EPSRC under grant EP/E001556/1. The research described in this paper has also been supported, in part, by the European Commission through the IST Programme under contract IST-2002-507932 ECRYPT. The information in this document reects only the authors views, is provided as is and no guarantee or warranty is given that the information is t for any particular purpose. The user thereof uses the information at its sole risk and liability. References [1] G. Bertoni, M. Macchetti, L. Negri, and P. Fragneto. Power-efcient ASIC synthesis of cryptographic Sboxes. In Proceedings of the 14th ACM Great Lakes Symposium on VLSI (GLSVLSI 2004), pp. 277 281. ACM Press, 2004. [2] D. Canright. A very compact S-Box for AES. In Cryptographic Hardware and Embedded Systems CHES 2005, vol. 3659 of Lecture Notes in Computer Science, pp. 441455. Springer Verlag, 2005. [3] P. Chodowiec and K. Gaj. Very compact FPGA implementation of the AES algorithm. In Crypto- graphic Hardware and Embedded Systems CHES 2003, vol. 2779 of Lecture Notes in Computer Science, pp. 319333. Springer Verlag, 2003. [4] J. Daemen and V. Rijmen. The Design of Rijndael: AES The Advanced Encryption Standard. Springer Verlag, 2002. [5] M. Feldhofer, K. Lemke, E. Oswald, F.-X. Standaert, T. Wollinger, and J. Wolkerstorfer. State of the Art in Hardware Architectures. ECRYPT deliverable D.VAM.2, available for download at http://www. ecrypt.eu.org/documents/D.VAM.2-1.0.pdf, Sept. 2005. [6] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen. AES implementation on a grain of sand. IEE Proceed- ings Information Security, 152(1):1320, Oct. 2005. [7] A. Hodjat, D. D. Hwang, B.-C. Lai, K. Tiri, and I. M. Verbauwhede. A 3.84 Gbits/s AES crypto co- processor with modes of operation in a 0.18-m CMOS technology. In Proceedings of the 15th ACM Great Lakes Symposium on VLSI (GLSVLSI 2005), pp. 351356. ACM Press, 2005. [8] H. Li. A parallel S-box architecture for AES byte substitution. In Proceedings of the 2nd International Conference on Communications, Circuits and Systems (ICCCAS 2004), vol. 1, pp. 13. IEEE, 2004. [9] R. Lidl and H. Niederreiter. Finite Fields, vol. 20 of Encyclopedia of Mathematics and Its Applications. Cambridge University Press, 1996. [10] M. Macchetti and G. Bertoni. Hardware implementation of the Rijndael SBOX: A case study. ST Journal of System Research, 0(0):8491, July 2003. [11] M. McLoone and J. V. McCanny. High performance single-chip FPGA Rijndael algorithm implemen- tations. In Cryptographic Hardware and Embedded Systems CHES 2001, vol. 2162 of Lecture Notes in Computer Science, pp. 6576. Springer Verlag, 2001. [12] N. Mentens, L. Batina, B. Preneel, and I. M. Verbauwhede. Systematic evaluation of compact hardware implementations for the Rijndael S-box. In Topics in Cryptology CT-RSA 2005, vol. 3376 of Lecture Notes in Computer Science, pp. 323333. Springer Verlag, 2005. [13] S. Morioka and A. Satoh. An optimized S-Box circuit architecture for low power AES design. In Cryp- tographic Hardware and Embedded Systems CHES 2002, vol. 2523 of Lecture Notes in Computer Science, pp. 172186. Springer Verlag, 2002. [14] National Institute of Standards and Technology (NIST). Data Encryption Standard (DES). Federal Information Processing Standards (FIPS) Publication 46-3, Oct. 1999. [15] National Institute of Standards and Technology (NIST). Advanced Encryption Standard (AES). Federal Information Processing Standards (FIPS) Publication 197, Nov. 2001. [16] N. Pramstaller and J. Wolkerstorfer. A universal and efcient AES co-processor for eld programmable logic arrays. In Field Programmable Logic and Application FPL 2004, vol. 3203 of Lecture Notes in Computer Science, pp. 565574. Springer Verlag, 2004. [17] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A compact Rijndael hardware architecture with S-Box optimization. In Advances in Cryptology ASIACRYPT 2001, vol. 2248 of Lecture Notes in Computer Science, pp. 239254. Springer Verlag, 2001. [18] S. Tillich, M. Feldhofer, and J. Grosch adl. Area, delay, and power characteristics of standard-cell implementations of the AES S-box. In Embedded Computer Systems: Architectures, Modeling, and Simulation SAMOS 2006, vol. 4017 of Lecture Notes in Computer Science, pp. 457466. Springer Verlag, 2006. [19] S. Tillich and J. Grosch adl. Instruction set extensions for efcient AES implementation on 32-bit processors. In Cryptographic Hardware and Embedded Systems CHES 2006, vol. 4249 of Lecture Notes in Computer Science, pp. 270284. Springer Verlag, 2006. [20] J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC implementation of the AES SBoxes. In Topics in Cryptology CT-RSA 2002, vol. 2271 of Lecture Notes in Computer Science, pp. 6778. Springer Verlag, 2002. [21] X. Zhang and K. K. Parhi. High-speed VLSI architectures for the AES algorithm. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(9):957967, Sept. 2004.