Professional Documents
Culture Documents
Processor
Yongzhi Fu, Lin Hao and Xuejie Zhang
Yunnan University,
Department of Computer Science and Engineering
CuiHu Road No.2, Kunming 650091, China
fuyongzhi@ynu.edu.cn
Rujin Yang
Yunnan Telecom Netit Group
Beijing Road No.605, Kunming 650224, China
yrj@ynnetit.com
Abstract
In this paper, we presented our implementation of a
counter mode AES processor based on the Xilinx Virtex2
FPGA platform. We have studied different techniques to im-
plement the AES rijndael algorithm in recongurable hard-
ware and choose the proper method to further optimize the
structure of the cipher. This result in a clock frequency
of 212.5MHz and translate to throughput of 27.1Gb/s, the
highest throughput that have ever reported. We also, in this
paper, compared the operation modes of AES, their security
and efciency.
1 Introduction
In 1997, NIST had initiated development effort for the
Advanced Encryption Standard (AES), the new American
block cipher standard designed to replace the over long-
term running DES. The main design principle of AES is
to achieve at least the security level of Triple DES while
improve the performance in both software and hardware
platforms. After three rounds of evaluation and selection
for about four years, rijndael, a block cipher proposed
by two Belgium cryptographers was selected as the of-
cial AES standard[1]. The evaluation of the algorithms
were mostly focused on security and performance. The
performance tests were carried out in both software and
hardware platforms. The software performance evaluations
were carried out in 8-bit and 32-bit processors. A lot of re-
searches were proposed for the specialized hardware archi-
tecture during and after the selection of the AES algorithm.
These researches can be categorized into two types: the
ASIC implementations[2, 3] and the Recongurable Logic
implementations[4, 5, 6, 7].
The ASIC implementations has the advantage of fully
optimized structure and thus resulted in smaller circuit area,
higher speed of operation, and lower power consumption.
But the design and implementation of ASIC is complex and
time consuming. The cost is very high. The ASIC circuit
can not be modied once it has been implemented. So it
can not be adopted to often changed environment. Most of
the designs were carried out on recongurable platforms.
The recongurable platforms make use of the FPGA tech-
nology which combined the high speed of specialized hard-
ware architecture and the agility of the software platform.
The recongurable platforms cost much less than the ASIC
implementations.
Most of the above designs were performed under the
Electronic Book Mode (ECB). The ECB mode encrypt
same plain text block into the same ciphered text block.
These implementations[2, 3, 4, 5, 6] have revealed some
pattern information about the plain text and thus in some
conditions should not be used. The Cipher Block Chain-
ing mode (CBC), the Cipher Feedback mode (CFB), and
the Output Feedback mode (OFB) have better security prop-
erty than ECB, but encryption of the block depends on the
feedback of its previous block encipherment. This prop-
erty of the modes has restricted their use of pipelining that
can encrypt many different blocks simultaneously. So the
speed of CBC, CFB, and OFB can not achieve the level
of ECB[7]. Our design was based on the newly developed
Counter mode (CTR)[8, 9] which have avoided the security
aw of ECB and do not have dependencies between differ-
Proceedings of the Second International Conference on Embedded Software and Systems (ICESS05)
0-7695-2512-1/05 $20.00 2005 IEEE
ent blocks. Thus the operations can be fully pipelined to
achieve extremely high performance. We have studied dif-
ferent techniques to implement the AES rijndael algorithm
in recongurable hardware and choose the proper method to
further optimize the structure of the cipher. The optimiza-
tion we used includes loop unrolling, inner and outer round
and mixed pipelining. Our tests and experiments were car-
ried out on Xilinx Virtex2 FPGAs. The clock frequency of
the fully mixed inner and outer round pipelined architecture
have achieved 212.5MHz and that translate to throughput of
27.1Gb/s, the highest throughput that have ever reported.
2 AES-Rijndael Algorithm and Its Recong-
urable Implementation
Rijndael is a block cipher with variable key length and
block length. The AES standard has specied the block
size of 128-bit and key size of 128-bit, 192-bit, and 256-
bit respectively. Our implementation has covered only the
128-bit key version, because the other two versions have the
same structure with 128-bit key version. The only differ-
ences are the number of rounds performed and round keys
needed.
2.1 AES General Architecture
Rijndael is a SP-network structure cipher. It was de-
signed to based on the Galois Field operations. The ci-
pher consists of four basic operations: SubBytes the S-box
substitution, ShiftRows which is a permutation by horizon-
tally rotations, MixColumns the matrix multiplication over
GF(2
8
), and AddRoundKeys the simple exclusive OR op-
eration. The transformation of the plain text to the cipher
text takes N
r
round of operations. N
r
is a number asso-
ciated with the key length. In the case of 128-bit key, N
r
equals 10. Most of the rounds take identical structure ex-
cept the last one which omitted the MixColumns. One way
to construct the cipher is to have the different rounds both
be implemented. See Fig.1a. We did not choose this ar-
chitecture because it cost more resources. Our implemen-
tation of the general architecture have a switch inserted be-
tween MixColumns and AddRoundKeys, see Fig.1b. The
switch takes the input from the output of ShiftRows and
MixColumns, when the round reaches N
r
, it connect the
AddRoundKeys with ShiftRows. For full round operations
it connect AddRoundKeys with MixColumns. In this way,
the full round and the last round can be implemented in just
one circuit.
2.2 Basic Operations of AES
SubBytes of AES involves large S-box substitution. The
input and output of the S-box are both 8-bit. Therefor, a
Ro un d Key M o dule
AddRo un dKey s
SubBytes
ShiftRows
MixColumns
AddRo un dKey s
SubBytes
ShiftRows
AddRo un dKey s
Input Key Input Bloc k
Ro un d Key M o dule
AddRo un dKey s
SubBytes
ShiftRows
MixColumns
AddRo un dKey s
Input Key Input Bloc k
a. Two Independent Round Implementation b. Two Round in One Arc hitec ture
Figure 1. General Iterative Structure of AES
S-box contains 256 entries of 8-bit content. It consumes a
lot of resources, since the S-boxes for each 8-bit input eld
within a round can be operate in parallel and that need 16
duplicate of the identical S-box for one 128-bit input block.
Many efforts have been made to optimize the implementa-
tion of AES S-box operation. The simplest way is to map
them directly into LUTs of FPGA. Because LUTs usually
take only 4 inputs, the implementation of 88 S-box would
require multi-stages of LUT operation and result in more
area cost. Chodowiec has proposed a way to make use of the
embed block RAMs of FPGA to implement the S-boxes[7].
For each of the S-box, there are 256 8bit = 2KbitRAM
needed. Thus every 4Kbit block RAM can be congured as
two S-boxes. It has cut the delay to only one stage of look-
up table search and the logic resources is saved. But unfor-
tunately, block RAMs are only available in some specic
FPGAs. The method lacks generality, so we decide not to
use it. Another way to optimize the S-boxes is to compute
the content directly with its mathematical form that com-
bines the multiplicative inverse operation of GF(2
8
) with
afne over GF(2
8
)[10]. Wolkerstorfer has proposed a tech-
nique to formulate smaller S-box implementations by ex-
pressing the elements of the single-eld GF(2
8
) as a poly-
nomial of elements of smaller eld GF(2
4
)[11]. In this
way, the resource requirement can be cut down. However,
the delay of the circuits increased[3]. In order to achieve
higher speed, we do not adopt this technique. Therefor, our
choice to implement the SubBytes relies on the rst method.
The ShiftRows performs 32-bit rotation on each of the 4
rows of the block. The rotation can be carried out by 32-bit
permutation which can be implemented simply by cong-
uring the routing resources. It requires no logic resources
at all. The 4 rotations can be performed in parallel and they
are very efcient.
MixColumns is the most complex operation in AES
and resulted in the longest critical path among the oper-
Proceedings of the Second International Conference on Embedded Software and Systems (ICESS05)
0-7695-2512-1/05 $20.00 2005 IEEE
ations. MixColumns involves multiplication and addition
over GF(2
8
). It can be expressed as a matrix multiplica-
tion in the Galois Field GF(2
8
):
B
0
B
1
B
2
B
3
01 03 01 01
01 02 03 01
01 01 02 03
03 01 01 02
A
0
A
1
A
2
A
3
!0
!
?0
?
30
0uu' |uo''d Out Tuu M`xd
Figure 6. Performance of the 5 Different Ar-
chitectures
efciency in the case that the cipher contains many small
rounds. If the cipher composed of little big rounds of op-
erations, the outer-round pipelining would not be that ef-
ciency as in small round ciphers. Chodowiec et al. have
proposed a technique known as inner-round pipelining to
solve the big round problem[13]. Inner round pipelining
cut one cipher round into pipeline stages and insert regis-
ters into the round, see Fig.5. This method would incur little
resource overhead to the general architecture. But the im-
plementation is difcult since to balance the stages within
the cipher round is not an easy task and the clock frequency
is determined by the longest stage. Our implementation
have tried to cut the AES cipher round into four pipeline
stages, with each stage according to one of the basic oper-
ations. To further divide the stages is possible but difcult,
because the longest path S-box substitution is hard to di-
vide. The experimental result have showed that for inner-
round pipelined structure the throughput increased by 26%
and resource consumption rise by 14%. To achieve the ex-
tremely high throughput, we have combined both inner and
outer round pipelining to form a mix pipelined architecture.
The architecture have many short stages and can encrypt on
block of data per clock. It have reached the frequency of
212.5MHz, which translate to throughput of 27.1Gb/s, the
highest one for AES that had ever reported. The resource
requirement also reaches the highest, 17887 slices which re-
quires 4 pieces of XC2V1000 FPGA to implement the full
cipher. We have also evaluated the architecture efciencies
by the ratio of highest achievable throughput in Mb/s to the
logic resource needed to implement that architecture. See
Fig.8, the mixed pipelining architecture is the most efcient
one with a ratio of 1.52 and the unrolled architecture is the
|:d S'`:
?+!
!0+0
!b8b+
?b!
!88
0
?000
+000
b000
8000
!0000
!?000
!+000
!b000
!8000
?0000
0uu' |uo''d Out Tuu M`xd
Figure 7. Resource Consumption
l``u)
0.b8
0.!?
!.0b
0.9
!.?
0
0.?
0.+
0.b
0.8
!
!.?
!.+
!.b
0uu' |uo''d Out Tuu M`xd
Figure 8. Efciency of the Five Architectures
lest efcient with a ratio of 0.12.
4 AES Counter Mode Operation and Secu-
rity Consideration
The counter mode have avoided the problem of ECB
which would reveal pattern information of plain text. It is
processed by encrypt the counter value with key K and ex-
clusive OR the output with the plain text to get the cipher
text. The procedure of counter mode operation is given in
Fig.9. Decrypt procedure takes the same process to cover
the plain text back from the cipher text. Since only the for-
ward cipher is needed, the implementation of the inverse
cipher is unnecessary. With identical structure for enci-
pher and decipher process, the implementation of counter
Proceedings of the Second International Conference on Embedded Software and Systems (ICESS05)
0-7695-2512-1/05 $20.00 2005 IEEE
Cipher K
Counter 1
Plain Text 1
Cipher Text 1
Cipher K
Counter 1
Cipher Text 1
Plain Text 1
Encipher
Decipher
Cipher K
Counter 2
Plain Text 2
Cipher Text 1
Cipher K
Counter 2
Cipher Text 2
Plain Text 2
Encipher
Decipher
. . . . .
. . . . .
Figure 9. Counter Mode Operations
mode ciphers have been simplied. The transformation of
a counter value have no dependencies with previous output,
thus pipelining can be fully used. Counter mode has the ad-
vantage of no padding overhead which is required for ECB,
CBC, and CFBmodes when the size of the data is not a mul-
tiple of block length. The padding would incur large penalty
in network applications when the Maximum Transfer Unit
is reached. In cipher text errors situations, CBC and CFB
mode will pass the error down to the following blocks, but
Counter mode can restrict the error to that specic block.
Therefor, in the case of extremely high throughput imple-
mentations, counter mode is the most suited. Counter mode
have special security requirement. The same counter value
and key should not be used to encrypt more than one block
of data. If that happened, the plain text information will be
revealed by exclusive OR the two cipher text, which equals
to exclusive OR of the two plain text. Especially in the case
that one of the plain text is already known, the other one can
be easily recovered by exclusive OR the known plain text to
the output of the XORed cipher text. To avoid such prob-
lem, we have cut the 128-bit counter into 3 parts, shown in
Fig.10. The counter consist of a 40-bit cipher ID, a 48-bit
key counter, and a 40-bit block counter. For each cipher,
there is a specied cipher ID. The number of Cipher IDs is
abundant. The key counter increases when a new key has
been updated. In the situation that the key updates for a
thousand times a minute, it will need thirty thousand years
to use up the key counter space. That is enough for our ap-
plications. For each key there can be up to 16TB encrypted
without refreshment of a new key. If the block counter is
used out, the key counter will be increased to avoid the use
of same key with same counter value. In this way, our de-
sign have guaranteed that there will be no same key and
counter value pairs be used for more than once.
Key Counter Bloc k Counter Bloc k Counter
48-bit Counter 40-bit Counter
40-bit
Cipher K
Plain Text
Cipher Text
LoadKey
Cloc k
Figure 10. Counter Value Generation
5 Related Works
Many groups have reported their implementation of AES
algorithm in ASIC[2, 3, 11] and FPGAs[4, 5, 6, 7, 12].
[3, 7]Have implemented Rijndael in general iterated archi-
tecture which can be applied in both feedback and non-
feedback modes. Nearly all the implementations have
make use of the pipelining techniques[2, 4, 5, 6, 12].
[12, 13, 14]Have made comparative studies of inner, outer,
and mixed round pipelining over a set of block cipher algo-
rithms. The aforementioned implementations are all based
on ECB or CBC mode. Our implementation is based on
counter mode. The counter mode have overcame the secu-
rity aw of ECB mode. It have cut out the dependencies
between the encryption of data blocks and their previous
output which exist in the feedback modes: CBC, CFB and
OFB and prevented their pipelining. Our implementation
of the fully mixed inner and outer round pipelined architec-
ture have achieved the throughput of 27.1Gb/s, the highest
one among these reports. It is achieved as a result of both
optimized architecture and the more advanced FPGA chips
used which have eliminated the delay of the circuits.
6 Summary
In this paper, an extremely high performance counter
mode AES processor designed on Xilinx Virtex2 FPGAs
have been presented. We have explored several techniques
Proceedings of the Second International Conference on Embedded Software and Systems (ICESS05)
0-7695-2512-1/05 $20.00 2005 IEEE
for implementation and optimization of the AES-Rijndael
algorithm on recongurable platforms. Among all these
techniques, the architecture with fully inner and outer round
pipelining achieved the highest throughput of 27.1Gb/s.
Our future work will extend the design to an application
in the IPsec ESP protocol. The newly published internet
standard[15] regarding the encryption of IP package with
AES algorithm have made our counter mode cipher an at-
tractive solution.
7 Acknowledgement
This work is supported by Project 60573104 supported
by National Natural Science Foundation of China.
References
[1] Federal Information Processing Standards Publication
197. Advanced Encryption Standard(AES), National
Institute of Standards and Technology, 2001.
[2] Ichikawa, T., Kasuya, T., Matsui, M. Hardware Eval-
uation of the AES Finalists. Proc. 3rd Advanced En-
cryption Standard (AES) Candidate Conference, New
York, April 13-14, 2000.
[3] Verbauwhede, I., Schaumont, P., Kuo, H. Design and
Performance Testing of a 2.29-GB/s Rijndael Pro-
cessor, IEEE Journal of Solid-State Circuits. Vol.38,
No.3, March 2003.
[4] Gaj, K., and Chodowiec P. Comparison of the hard-
ware performance of the AES candidates using recon-
gurable hardware, Proc. 3rd Advanced Encryption
Standard (AES) Candidate Conference, New York,
April 13-14, 2000.
[5] Elbirt, A., Yip, W., Chetwynd, B., Paar, C. An FPGA
Implementation and Performance Evaluation of the
AES Block Cipher Candidate Algorithm Finalists,
Proc. 3rd Advanced Encryption Standard (AES) Can-
didate Conference, New York, April 13-14, 2000.
[6] Weaver, N., Wawrzynek, J. A comparison of the
AES candidates amenability to FPGA Implementa-
tion, Proc. 3rd Advanced Encryption Standard (AES)
Candidate Conference, New York, April 13-14, 2000.
[7] Chodowiec, P. Gaj, K. Bellows, P. Schott, B. Experi-
mental Testing of the Gigabit IPSec-Compliant Imple-
mentation of Rijndael and Triple DES Using ALAAC-
1V FPGA Accelerator Board. Proc. Information Secu-
rity Conference, Malaga, Spain, October, 2001.
[8] Lipmaa, H., Rogaway, P., Wagner, D. CTR-Mode En-
cryption, Public Workshop on Symmetric Key Block
Cipher Modes of Operation, Baltimore, MD, October
2000.
[9] Dworkin M. Recommendation for Block Cipher
Modes of Operation Methods and Techniques. NIST
Special Publication 800-38A, December, 2001.
[10] Daemen, J., Rijnmen, V. The Design of Rijndale:
AES-The Advanced Encryption Standard. New York:
Springer-Verlag, 2002.
[11] Wolkerstorfer, J., Oswald, E., Lamberger, M. An
ASIC Implementation of the AES S-boxes, Proc. RSA
Conf.2002, San Jose, CA, February, 2002. pp.67-78.
[12] Chodowiec, P. Khuon, P. Gaj, K. Fast Imple-
mentations of Secret-Key Block Ciphers Using
Mixed Inner- and Outer-Round Pipelining. Proc
ACM/SIGDA Ninth International Symposium on
Field Programmable Gate Arrays, FPGA01, Mon-
terey, February 2001, pp.94-102.
[13] Pawel Chodowiec. Comparison of the Hardware Per-
formance of the AES Candidates Using Recong-
urable Hardware, Master Thesis, George Mason Uni-
versity, 2002.
[14] Adam Elbirt. Recongurable Computing for
Symmertric-Key Algorithms, Ph.D Dissertation,
Worcester Polytechnic Institute, 2002.
[15] Housley, R. Using Advanced Encryption Standard
(AES) Counter Mode With IPsec Encapsulating Se-
curity Payload (ESP), RFC 3686, January 2004.
Proceedings of the Second International Conference on Embedded Software and Systems (ICESS05)
0-7695-2512-1/05 $20.00 2005 IEEE