You are on page 1of 86

Masters Thesis

High-Speed RSA Implementation for FPGA Platforms


Thomas Wockinger

Institute for Applied Information Processing and Communications Graz University of Technology

under the guardiance of Ao.Univ.-Prof. Dipl.-Ing. Dr.techn. Karl-Christian Posch and Univ.-Ass. Dipl.-Ing. Dr.techn. Johannes Wolkerstorfer as the contributing advisor responsible Graz, January 2005

Abstract
Providing cryptographic infrastructure for thousands of users on a server, requires an enormous computing power. Therefore, it makes sense to use hardware accelerator cards. In this context, the thesis treats the implementation of RSA on FPGA devices. The entire design, from the software front-end to the hardware backend, is shown. With the implementation of software interfaces, for example JCE and JCA, existing programs do not need to be modied and can nevertheless prot from the performance. The basic operation for computing the RSA algorithm is modular exponentiation. It can be computed by repeated modular multiplication. There exist many hardware-based implementations of RSA on FPGA platforms. These are restricted to architectures operating not on the full word size, because the resulting hardware complexity for word sizes greater than 1024 bit is too big. Since the development of the FPGA devices allows implementing more complex circuits, today it is possible to implement also architectures with a higher radix. The presented multiplier architecture is operating at full word size and employs even higher-radix multiplication to speed-up operation. Since the needed FPGA resources for a high radix implementation are enormous, suitable measures need to be applied to achieve high clock rates. Therefore, it is necessary to tailor the architecture towards FPGA structures.

Kurzfassung
Um eine kryptographische Infrastruktur fr tausende Benutzer zentralisiert auf u einem Server zur Verfgung zu stellen, wird eine enorme Rechenleistung bentigt. u o Es ist daher sinnvoll, diese Server mit Hardwarebeschleunigerkarten zu entlasten. In diesem Zusammenhang behandelt die Diplomarbeit die Implementierung von RSA auf FPGAs. Es soll dabei der gesamte Entwurf vom Software-Frontend bis zum Hardware-Backend gezeigt werden. Durch die Implementierung von Softwareschnittstellen, wie zum Beispiel JCE und JCA, mssen bestehende Programme u nicht verndert werden und knnen dennoch von der Performance protieren. Die a o grundlegende mathematische Operation bei RSA ist die modulare Exponentitation. Diese kann wiederum durch eine wiederholte modulare Multiplikation berechnet werden. Es gibt bereits eine Vielzahl von hardwarebasierten RSA-Implementierungen fr FPGAs, diese operieren jedoch nicht auf der vollen Wortbreite. Der Grund u dafr ist der enorme Platzbedarf fr Wortbreiten jenseits von 1024. Da die Enu u twicklung der FPGAs stetig voranschreitet, ist es heute mglich, auch Architekturen o mit hherem Radix zu implementieren. Die gezeigte Architektur ist parametrisiero bar, damit unterschiedliche Wortbreiten und Radizes ohne Anderung der Hardwarebeschreibung synthetisiert werden knnen. Da der Flchenbedarf fr eine High o a u Radix-Implementierung enorm ist, sind geeignete Manahmen zu treen, um die Taktrate zu maximieren. Dazu ist es notwendig, optimierte Hardwarestrukturen eines FPGAs zu verwenden.

Acknowledgments
I would like to thank Hannes Wolkerstorfer who is an outstanding advisor. Besides his work he always took the time to discuss dierent design approaches, to answer my emails, and to let me prot from his experience. Special thanks to my parents for supporting me during the years of study. As well I would like to thank my girlfriend Susi for her understanding during the work on this thesis (ILU). I would also like to thank Klaus Schwarzenberger for the time spent correcting this thesis several times. Last but not least, I want to thank all my friends in Graz, especially from the great Eheim, for sharing their time with me.

Contents
List of Figures List of Tables List of Algorithms List of Abbreviations 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Problem Analysis and Theoretical Background 2.1 Public-Key Cryptography . . . . . . . . . . . . . 2.1.1 Advantages of Public-Key Cryptography . 2.1.2 Security of the Private Key . . . . . . . . 2.2 RSA . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Generating Keys . . . . . . . . . . . . . . 2.2.2 Encryption . . . . . . . . . . . . . . . . . 2.2.3 Decryption . . . . . . . . . . . . . . . . . 2.3 Security of RSA . . . . . . . . . . . . . . . . . . . 2.3.1 The Selection of p and q . . . . . . . . . . 2.3.2 The Selection of e . . . . . . . . . . . . . 2.3.3 The Selection of d . . . . . . . . . . . . . 2.4 Signatures with RSA . . . . . . . . . . . . . . . . 2.5 Fast Modular Exponentiation . . . . . . . . . . . 2.6 Modular Multiplication . . . . . . . . . . . . . . 2.6.1 Montgomery Multiplication . . . . . . . . 2.6.2 Orups Optimization . . . . . . . . . . . . 2.7 Multiplication . . . . . . . . . . . . . . . . . . . . 2.8 High-Radix Multiplication . . . . . . . . . . . . . 2.8.1 Redundant Number Representation . . . 2.8.2 Modied Booths Recoding . . . . . . . . 2.8.3 Carry-Save Adders . . . . . . . . . . . . . i ii iii iv 1 1 2 2 4 4 5 5 5 6 6 6 7 7 8 8 8 8 10 11 13 16 17 19 19 21

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

3 Design Methodology 3.1 Design Flow . . . . . . . . . . . . . . . . 3.1.1 State of the Art Implementations 3.1.2 High-Level Model . . . . . . . . . 3.1.3 Hardware Description Language 3.1.4 Simulation . . . . . . . . . . . . 3.1.5 Synthesis . . . . . . . . . . . . . 3.2 Software . . . . . . . . . . . . . . . . . . 3.2.1 The Layer Pattern . . . . . . . . 3.2.2 Software Architecture . . . . . . 3.3 Hardware . . . . . . . . . . . . . . . . . 3.3.1 Control Path and Data path . . 3.3.2 Synchronizing Clock Domains . . 3.3.3 Architecture of FPGAs . . . . . 3.3.4 Xilinx FPGAs . . . . . . . . . . 3.3.5 Designing for FPGAs . . . . . . 3.3.6 High Performance Optimizations 4 Implementation 4.1 Software . . . . . . . . . . . . . . . . . 4.1.1 JCE and JCA . . . . . . . . . . 4.1.2 JNI Interface . . . . . . . . . . 4.1.3 Device Driver . . . . . . . . . . 4.2 Hardware . . . . . . . . . . . . . . . . 4.2.1 RSA Chip . . . . . . . . . . . . 4.2.2 Control Unit . . . . . . . . . . 4.2.3 Montgomery Multiplier . . . . 4.2.4 XILINX Single-Port RAM . . . 4.2.5 Parallel-Loadable Shift-Register 4.2.6 PCI Interface . . . . . . . . . . 4.2.7 High-Radix Multiplier . . . . . 4.2.8 Wallace Tree . . . . . . . . . . 4.2.9 Modied Wallace Tree . . . . . 4.2.10 Ripple-Carry Adder . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

25 25 26 27 28 29 30 32 33 33 34 35 35 37 38 40 41 43 43 43 44 46 47 47 48 55 55 57 58 60 61 63 63

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

5 Results 66 5.1 Word Size and Radix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Performance and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Comparison with other Implementations . . . . . . . . . . . . . . . . . . . . 70 6 Conclusion 72 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Bibliography 74

List of Figures
2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 Multiplication of two binary numbers in dot notation . . . Radix-4 multiplication in dot notation . . . . . . . . . . . Operand selection for radix-4 multiplication . . . . . . . . Radix-4 partial product generation with Booth recoding . Conversion of a ripple-carry adder into a carry-save adder 6-to-2 CSA reduction tree . . . . . . . . . . . . . . . . . . Radix-4 multiplication with carry-save adder . . . . . . . FPGA design ow . . . . . . . . . . . . . . . . . Comparison between VHDL and Verilog . . . . . Hardware synthesis . . . . . . . . . . . . . . . . . Examples for the layer pattern . . . . . . . . . . Software architecture . . . . . . . . . . . . . . . . Separating hardware in control- and data path . Clock to output delay of ip-ops . . . . . . . . . Meta-stable states of ip-ops . . . . . . . . . . . Signal synchronization with double synchronizer Architecture of Xilinx Spartan II . . . . . . . . . Xilinx Spartan II CLB implementation . . . . . . Xilinx Spartan II slice implementation . . . . . . Architecture of the RSA chip . . . . . . . . . Architecture of the control unit . . . . . . . . State diagram of the control unit . . . . . . . IRQ handshake circuit . . . . . . . . . . . . . Architecture of the Montgomery multiplier . Timing diagram of the PCI write transaction Timing diagram of the PCI read transaction . Schematic of the interface . . . . . . . . . . . Architecture of the high-radix multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 18 18 21 22 23 24 26 29 31 33 34 35 36 36 37 39 40 41 48 53 54 54 56 59 60 61 62

List of Tables
2.1 2.2 2.3 2.4 4.1 4.2 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 Radix-2 Booths recoding Radix-4 Booths recoding Radix-8 Booths recoding Basic Boolean operations scheme scheme scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 20 21 23

State encoding formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Operation mode of the Montgomery multiplier . . . . . . . . . . . . . . . . 55 XILINX Xc2s200 synthesis results . . . . . . . XILINX Xc2v3000 synthesis results . . . . . . . XILINX Xc2v4000 synthesis results . . . . . . . XILINX Xc2v6000 synthesis results . . . . . . . XILINX Xc2v8000 synthesis results . . . . . . . XILINX Xc2s200 throughput results . . . . . . XILINX Xc2v3000 throughput results . . . . . XILINX Xc2v4000 throughput results . . . . . XILINX Xc2v6000 throughput results . . . . . XILINX Xc2v8000 throughput results . . . . . Throughput results with average exponent . . . Throughput results with F4 exponent (216 + 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 67 67 67 67 68 69 69 69 70 70 70

ii

List of Algorithms
1 2 3 4 5 6 7 8 9 10 11 Square and multiply, right to left . . . . . . . Square and multiply, left to right . . . . . . . Square and multiply, optimized version . . . . Montgomery multiplication algorithm, version Montgomery multiplication algorithm, version Montgomery multiplication algorithm, version Montgomery multiplication algorithm, version Binary right-shift multiplication algorithm . . Binary left-shift multiplication algorithm . . . Radix-r right-shift multiplication algorithm . Radix-r left-shift multiplication algorithm . . . . . 1 2 3 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 10 12 13 14 15 16 17 17 17

iii

List of Abbreviations
AMD API CLB CMOS CPU CSA DLL DLLs DPRAM FA FIFO FPGA GCHQ HDL ISR IOB JCA JCE JSP LSB LUT MFC MIT MSB PCI PKCS RAM RSA SPRAM SSL VHDL VLSI Advanced Micro Devices Application Programming Interface Congurable Logic Blocks Complementary Metal Oxide Semiconductor Central Processing Unit Carry Save Adder Direct Link Library Delayed Locked Loops Dual-Port Random-Access Memory Full Adder First In First Out Field Programmable Gate Array Government Communications Headquarters (British intelligence service) Hardware Description Language Interrupt-Service Routine Input/Output Blocks Java Cryptography Architecture Java Cryptography Extension Java Server Pages Least Signicant Bit Lookup Tables Microsoft Foundation Classes Massachusetts Institute of Technology Most Signicant Bit Peripheral Component Interconnect Public-Key Cryptography Standards Random Access Memory Asymmetric Cryptographic System by Ronald L. Rivest, Adi Shamir, and Leonard Adleman Single-Port Random-Access Memory Secure Socket Layer Very High-Speed Hardware Description Language Very Large Scale Integration

iv

Chapter 1 Introduction
1.1 Motivation

Public-key cryptography is used in our daily life. It becomes particularly important for setting up safe connections over insecure networks like the Internet. It allows to exchange authenticated data and to generate digital signatures. The applied mathematical operations are very time-consuming. If the RSA algorithm is used as cryptographic backend, the basic mathematical operation is modular exponentiation. To achieve appropriate security, the used word size should not be less than 1024 bits. Modular exponentiation is achieved by repeated modular multiplication. Implementing the RSA algorithm on hardware is typically done by implementing a fast modular multiplier. Designing such a multiplier is always a tradeo between achieved speed and needed area. For example, architectures featuring multipliers with small or moderate word sizes do not need much area, but their computation time is longer because they need more clock cycles to obtain the result. On the other hand, parallel multipliers (such as array multipliers) can obtain the result in one clock cycle, but they are very big in terms of required area. Therefore, it is a good idea to use parameterizable designs. To build parameterizable hardware, the use of a hardware description language (HDL) is indispensable. As well, the used algorithm must oer the possibility to change the word size and the radix. There are many server-side applications which need high performance encryption and decryption devices. For example a web server providing asymmetric authentication to several thousand users. It therefore is highly recommendable to use special hardware to accelerate the cryptographic computation substantially. Todays FPGAs are already manufactured on 90 nm CMOS processes and oer a large number of ip-ops and logical functions. Therefore, it is possible to implement the RSA algorithm with actual word sizes from 1024 bits up to 2048 bits to achieve throughput rates required for server operation. There exist many software frontends which provide already optimized encryption and decryption of data and generation and verication of digital signatures. Wellknown Java frontends are JCE and JCA [LFB+ 00]. In this work, these interfaces 1

were implemented to use the FPGA device as encryption and decryption chip. This oers the advantage that already existing programs do not need to be modied, nevertheless they can prot from the improved performance.

1.2

Objectives

The goal of this work is a high-speed hardware implementation of the RSA algorithm on an FPGA platform, oering an easy way to change the used word size and radix of the implementation. This leads to parameterizable hardware. Therefore, VHDL is used for implementing such a parameterizable design. Thus it is possible to pre-compute as many FPGA congurations as needed, which is a precondition to adapt the RSA implementation for particular needs. This makes an FPGA nearly as exible as software. To exchange data with the RSA device, there is also an interface (kernel driver) to the operating system required. For this application, the use of the layer pattern is obviously recommendable. This oers the advantage that each layer is implemented independently from the others. The whole design ow, from the JCE interface to the hardware description, is shown in this thesis. In order to prove the functionality of the implementation, a proof of concept implementation is shown on a Xilinx Spartan II (Xc2s200) device. The objective for this relatively small FPGA is to reach the maximally possible clock frequency of 66 MHz with greatest possible word size and multiplier radix. In order to be able to compare the performance with other implementations, the circuit is also to be synthesized on larger Xilinx devices like the Virtex II line. With such high-end FPGAs it should be possible to provide up to thousand 1024-bit RSA signatures per second.

1.3

Structure of this Thesis

The thesis is written in a top down manner, which means that general problems will be described rst and details are discussed later on. The next chapter deals with the theoretical background of public-key cryptography. It covers the RSA algorithm, digital signatures, Montgomery multiplication, high-radix multiplication and much more. It also explains why the chosen algorithm can be implemented very well in hardware. The third chapter describes the used design methodology in more detail. Keywords such as synthesis, simulation and hardware description languages are described in detail. Both the software design process is treated as well as the hardware design process. In the hardware section, the implementation of complex circuits on FPGA basis is discussed in particular. The focus lies on high-speed optimization in order to achieve high clock rates and thus a high throughput with these devices. The fourth chapter describes the implementation of the RSA hardware accelerator itself. Again, it separately treats software and hardware. Important basic 2

elements of the implementation are explained and reasons for design decisions are described explicitly. The fth chapter shows the obtained results and compares them with other implementations. It also tries to point out the pros and cons of the selected implementation. The last chapter concludes the work and gives an outlook on possible future improvements.

Chapter 2 Problem Analysis and Theoretical Background


This chapter is organized in a top-down manner. It starts with public-key cryptography. Then RSA, modular exponentiation, modular multiplication, and fast multiplication are treated. Finally, the usage of redundant number systems with Booths recoding and carry save addition are discussed.

2.1

Public-Key Cryptography

Before going into the details of public-key cryptography, the dierence between asymmetric and symmetric cryptosystems should be explained. If Alice wants to send messages to Bob, which are encrypted with a cryptosystem, Alice needs a key e for encryption and Bob the associated key d for decryption. If in the cryptosystem the encryption key e is equal to the appropriate decryption key d or d can be easily computed from e, we speak of a symmetric cryptosystem or a secret-key system. If such a system is employed, Alice and Bob must exchange the key e at the beginning of the communication over a safe channel and then keep it secret. The decryption key d can be computed from e. In asymmetric cryptosystems, d and e are dierent and d can not be computed in reasonable time from e. If Bob wants to receive encrypted messages, he publishes the encryption key e and keeps the decryption key d secret. Everyone can use e now, in order to send an encrypted message to Bob. Therefore, e is also called the public key and d is called the private key. A cryptosystem where the private key cannot be derived from the public available data (the public key) is also known as a public-key system. Such cryptosystems are also called asymmetric cryptosystems. Public-key systems often use two dierent key spaces, since encoding and decoding keys have dierent representations. RSA (see section 2.2) uses a public key consisting of a pair (n, e) of natural numbers, while the private key is one natural number d. 4

Cliord Cocks has developed the rst asymmetric algorithm at the GHCQ in the year 1970. The algorithm was reinvented by Rivest, Shamir and Adleman in the year 1976 at the MIT [RSA79].

2.1.1

Advantages of Public-Key Cryptography

The disadvantage when using symmetric coding algorithms is the distribution and administration of secret keys. Before each secret communication, Alice and Bob must exchange a secret key. Therefore, a safe channel must be available. This problem becomes increasingly dicult if more participants in a network want to communicate with one another. When using public-key algorithms, the key management becomes substantially simpler, because there is no need to exchange secret keys beforehand. A problem of public-key cryptography is the mapping of public keys to a particular person. If someone wants to send an encrypted message to Alice, he has to ensure that the public key he uses really was published by Alice. If an attacker succeeds in replacing the public key of Alice by his own, then the attacker can decrypt all messages which were intended for Alice. In practice, asymmetric and symmetric cryptography were often mixed because asymmetric cryptography is too slow for bulk encryption. Therefore, asymmetric cryptography is used for generation and verication of digital signatures and key establishment. Symmetric cryptography is used for bulk encryption. Examples for such hybrid systems are SSL, OpenSSL, and TLS.

2.1.2

Security of the Private Key

A public-key system can be only safe if it is impossible to compute the private keys from the public keys in reasonable time. Today this is ensured by the complexity of computation in order to resolve the corresponding problems of number theory. There is however no certain knowledge about the persistence of this situation in the future. It is well known for example, that quantum computers can make all common public-key systems unsafe. However, this does not imply that such computers can really be built. Therefore, it is absolutely necessary to design security infrastructures in a way that enables the easy replacement of the used cryptographic techniques. There is a huge amount of literature about public-key cryptography and its security. A selection can be found under [Buc03, DH76, Sch96, Sti95].

2.2

RSA

The RSA cryptosystem, designated after its inventors Ron Rivest, Adi Shamir, and Len Adleman. It was the rst publication of a public-key system and is still one of the most important ones. Its security is closely associated with the diculty 5

of factorizing large numbers. The following section describes how to use the RSA cryptosystem. Especially key generation, the encryption, and the decryption are treated in detail.

2.2.1

Generating Keys

Alice selects two random prime numbers p and q and computes the product n = p q. Additionally Alice selects a natural number e with 1 < e < (n) = (p 1) (q 1) and gcd(e, (p 1) (q 1)) = 1 and computes a natural number d with 1 < d < (p 1) (q 1) and d e 1 mod (p 1) (q 1). (2.3) (2.1)

(2.2)

Since gcd(e, (p 1) (q 1)) = 1, there is actually such a number d. It can be computed with the extended Euclidean algorithm. We also consider that e is always odd. The public key consists of the pair (n, e). The private key is d. The number n is called RSA modulus, e is called encryption exponent and d is called decryption exponent. Nowadays, the word size of the used modulus is 1024 bits up to 2048 bits. So the chosen primes p and q are between 512 bits and 1024 bits. For signatures 216 + 1 is often used as encryption exponent e (see section 2.3.2).

2.2.2

Encryption

The nite set of plaintexts consists of all natural numbers m with 2 m < n. (2.4)

Zero and one should not be used because in that case the resulting ciphertext would equal the plaintext. A plaintext m is encrypted with c = me mod n. Where c is the ciphertext. (2.5)

2.2.3

Decryption

The decryption of RSA is based on the following theorem. If (n, e) is a public key and d is the according private key in the RSA system, then we have (me )d mod n = m 6 (2.6)

for each natural number m with 0 m < n Proof: Since e d 1 mod (p 1) (q 1), there is an integer l, so that e d = 1 + l (p 1) (q 1). Therefore (me )d = med = m1+l(p1)(q1) = m (m(p1)(q1) )l . This equation shows that (me )d m (m(p1) )(q1)l m mod p (2.9) (2.8) (2.7)

applies. If p is not a divisor of m, this congruence results from the small theorem of Fermat. Otherwise the statement is trivial, since both sides of the congruence are 0 mod p. Exactly the same applies to (me )d m mod q. Because p and q are dierent prime numbers, we achieve (me )d m mod n. If c was computed as in equation 2.5, we can reconstruct m by means of m = cd mod n. (2.12) (2.11) (2.10)

2.3

Security of RSA

To nd out the secret RSA key is as dicult as factorizing the RSA modulus. The proof can be found at [Buc03, Sch96, Mur94, Sti95]. However, the attacker can also have the intention to nd the plaintext from the ciphertext. It is not known whether it is therefore necessary to nd the secret RSA key. But even if it could be proved that breaking RSA is as dicult as factorizing the RSA modulus, this would not automatically signify that RSA is totally secure, since there is no knowledge about the diculty of factorizing natural numbers.

2.3.1

The Selection of p and q

To make the factorization of the RSA modulus n as dicult as possible, both prime factors p and q should be chosen the same size. Sometimes it is still required that p and q are chosen with respect to known factorization algorithms and their computation time, so that a result can not be obtained in reasonable time. Hence, p and q should be chosen randomly and uniformly distributed. For long term security, the length of the RSA modulus should be at least 1024 bits. More about key sizes can be found at [LV99]. 7

2.3.2

The Selection of e

The exponent e should be chosen in such a manner that the encryption is ecient, without decreasing the security. The choice e = 2 is always excluded because (n) = (p1)(q 1) is even and gcd(e, (p1)(q 1)) = 1 must be matched. Therefore, the smallest exponent would be three. As shown in [Buc03], with e = 3 a low exponent attack is going to be successful. A common choice of the encryption exponent is e = 216 + 1. This exponent withstands known low-exponent attacks, whilst speeding up encryption signicantly.

2.3.3

The Selection of d

The decryption exponent d has to be greater than n0.292 , otherwise the RSA cryptosystem can be broken. The according proof can be found at [BD00].

2.4

Signatures with RSA

RSA can also be used to generate digital signatures. The idea is quite simple. Alice signs the document m by applying her decryption function to the document, she computes s = md mod n, in which n is the RSA modulus and d the decryption exponent. Bob veries the signature by applying her encryption function to the signature, he computes se mod n, where e is the public key of Alice. If the result is the document m, the signature is valid. The key generation works the same way as with the RSA cryptosystem. There are also some known vulnerabilities when signatures are produced like this. An attacker can apply the no message attack or can use the multiplicative property of RSA. More about attacks on RSA signatures can be found at [Buc03]. Therefore, also a hash function is applied to the document or the document is lled up with some random numbers. There are many possibilities to improve the security of signatures. Therefore cryptographic standards like PKCS #1 were introduced for signatures and other cryptographic applications. They can be found at [rsa04].

2.5

Fast Modular Exponentiation

For fast RSA encryption and decryption, ecient modular exponentiation is crucial. This can be achieved by the Square and Multiply algorithm. The idea behind square and multiply is based on the binary representation of the exponent. The binary representation can be obtained with
l1

e=
i=0

ei 2i 8

(2.13)

where l is the word size of e. The modular exponentiation can be expressed as


l1

p = a mod m a

Pl1

i=0 ei 2

i=0

(a2 )ei .

(2.14)

So we need to square in each iteration and to multiply only if the ith bit of the exponent is one. This leads to algorithm 1 and algorithm 2. Algorithm 1 Square and multiply, right to left Input: a, e, m Output: p = ae mod m 1: p = 1 2: y = a 3: for i = 0 to l 1 do 4: p = p y mod m 5: if ei == 1 then 6: y = y y mod m 7: end if 8: end for 9: return p

Algorithm 2 Square and multiply, left to right Input: a, e, m Output: p = ae mod m 1: p = 1 2: for i = l 1 downto 0 do 3: p = p p mod m 4: if ei == 1 then 5: p = p a mod m 6: end if 7: end for 8: return p The advantage of algorithm 2 compared to algorithm 1 is that no additional variables are required for intermediate results. This is very important for the implementation in hardware, because no additional registers are needed. The algorithm can be optimized even further, leading to algorithm 3. With algorithm 3, the number of multiplications can be reduced if the position of the MSB is lower than the word size of the exponent. The hardware implementation of this algorithm is very simple, the exponent only needs to be shifted until the rst 1 occurs, thereafter the square-and-multiply procedure can start. The detailed implementation is shown in chapter 4.2.2. 9

Algorithm 3 Square and multiply, optimized version Input: a, e, m Output: p = ae mod m Require: e > 1 1: p = a 2: i = l 1 3: while ei ! = 1 do 4: i=i1 5: end while 6: for j = i 1 downto 0 do 7: p = p p mod m 8: if ej == 1 then 9: p = p a mod m 10: end if 11: end for 12: return p Since we need up to two modular multiplications in each iteration, these operations should perform as fast as possible. A very fast modular multiplication, without trail division, is achieved by the Montgomery multiplication algorithm, which is discussed in the next section.

2.6

Modular Multiplication

A trivial version of modular multiplication of p = a b mod m would compute a b rst and then apply a modular reduction step. Therefore, the following mathematic operations need to be calculated. p = a b f loor This approach has two essential drawbacks. The word size is doubled for the intermediate result if the word size of a and b is equal, because the word size is calculated by log2 a + log2 b. When using up to 2048 bits for each operand, with this increase of complexity it is nearly impossible to build fast and small hardware multipliers. After each multiplication a modular reduction step is needed. This is usually done by an integer division, which is a complex operation. These concerns can be handled by the Montgomery multiplication with some additional costs for pre- and post-processing. 10 ab m (2.15)

2.6.1

Montgomery Multiplication

The Montgomery algorithm [Mon85] calculates M onM ul(a, b) = a b r1 mod m from the two numbers a and b. In comparison to conventional modular multiplication, a further factor r1 is needed. How to compute r is shown later in this section. To compute a b mod m, some transformations are needed, the reason is shown in equation 2.18. However, these transformations can also be done with the Montgomery algorithm. Transformation to the Montgomery Domain To transform an integer a to the Montgomery domain, it needs to be multiplied by the constant r2 . The result of this multiplication is the transformed number a a = M onM ul(a, r2 ) = a r2 r1 mod m = a r mod m Transformation to the Z Domain Also the inverse transformation can be computed by a Montgomery multiplication. In this case, an element of the Montgomery domain needs to be multiplied by the constant 1. The result of this multiplication is the transformed number a. (2.16)

a = M onM ul(, 1) = a 1 r1 mod m = a 1 = a r 1 r mod m = a mod m Modular Multiplication using Montgomery Multiplication

(2.17)

With help of these two transformations, the modulo multiplication of two integers a and b can be performed. First the two numbers will be transformed into the Montgomery domain, then the multiplication is applied, afterwards the result is transformed back into the Z domain.

a = M onM ul(a, r2 ) = a r2 r1 mod m = a r mod m = M onM ul(b, r2 ) = b r2 r1 mod m = b r mod m b c = M onM ul(, = a r1 mod m = a b) b = a r b r r1 mod m = a b r mod m c = M onM ul(, 1) = c 1 r1 mod m = c 1 = a b r 1 r mod m = a b mod m 11

(2.18)

Montgomery Multiplication Algorithm Algorithm 4 shows the original Montgomery algorithm, where m is the modulus of the modular multiplication, a and b are the operands, p is the resulting product, k is the number of bits which are processed in one computation cycle (so this is a 2k -radix version), n is the number of computation cycles to complete the modular multiplication, and r, m are constants needed during the computation. There are two dierent denitions of the so called radix found in literature. Some dene the radix as 2k and some as k, where k is the word size. For example: if k = 4, in the rst case the radix is 16, and in the second case it is 4. In this thesis the radix is always dened as 2k . Algorithm 4 Montgomery multiplication algorithm, version 1 Input: a, b, k, m Output: p = a b r1 mod m and 0 p 2 m Require: m > 2 with gcd(m, 2) = 1 Require: k, n such that 4 m < 2kn Require: r, m with 2kn r1 mod m = 1 and m m mod 2k = 1 Require: a with 0 a 2 m l1 Require: b with i=0 (2k )i bi and 0 b 2 m 1: p = 0 2: for i = 0 to n 1 do 3: q = (((p + bi a) mod 2k ) m ) mod 2k 4: p = p+qm+bi a 2k 5: end for 6: return p For calculating all required constants, the word size of the modulus m, the word size of the radix k and the modulus m are needed as input. 4m log2 4 m log2 4 + log2 m 2 + wordsize(m) 2kn kn kn kn 2 + wordsize(m) n > k 2 + wordsize(m) n = k < < < <

(2.19)

r is calculated from 2kn r1 mod m = 1 with r = 2kn . m is the negative modulus inverse of m to the base 2k and can easily be computed with the extended Euclidean algorithm. The correctness of these algorithms is shown in [Mon85] and 12

[Oru95]. The advantage of the Montgomery algorithm is the easy computation of q and p. They need to be calculated in each iteration. For q, two multiplications, one addition and two modular reductions are needed. For p, two k-bit multiplications, two additions and one division are needed. The division can be performed as right shift operation, because the divisor is a power of two. The modulus reduction is very simple too, because the modulus is a multiple of two. To perform a modulus reduction of 2k , only the least signicant k bits from the operand need to be taken. To reduce the amount of multiplications and modular reductions, some optimizations are required, which were introduced by Holger Orup [Oru95].

2.6.2

Orups Optimization

When using a higher radix in algorithm 4, the result can be computed in fewer cycles. But the computation time for one iteration increases, because the quotient determination requires an addition of l + k bits and two multiplications of k (l + k) bits. Therefore, it is possible that the total computation time will increase. The major advantage of Orups optimization is the possibility to pipeline the computation and thus to reduce the logical depth and increase the clock rates. The idea is to reduce the expensive algorithmic operations in each iteration by using additional precomputed constants during the calculation. Algorithm 5 avoids the multiplication in quotient determination. The condition m mod 2k = 1 must be fullled for all values of the modulus. This is done by transforming the constant m to m. The k new constant m can be calculated by m = (m mod 2 ) m. Algorithm 5 Montgomery multiplication algorithm, version 2 Input: a, b, k, m Output: p = a b r1 mod m and 0 p 2 m Require: m > 2 with gcd(m, 2) = 1 Require: k, n such that 4 m < 2kn where m = (m mod 2k ) m kn 1 Require: r, m with 2 r mod m = 1 and m m mod 2k = 1 Require: a with 0 a 2 m l1 k i Require: b with i=0 (2 ) bi and 0 b 2 m 1: p = 0 2: for i = 0 to n 1 do 3: q = (p + bi a) mod 2k 4: p = p+qm+bi a 2k 5: end for 6: return p In comparison to algorithm 4, the borders of n changed from 4 m < 2kn to 4 m < 2kn . So it is necessary to recalculate n. m is given by m = (m mod 2k ) m 13 (2.20)

so the word size of m is calculated by wordsize(m) = wordsize(m mod 2k ) + wordsize(m) = k + wordsize(m). (2.21) Therefore n is found by 4m log2 4 m log2 4 + log2 m 2 + wordsize(m) 2 + wordsize(m) + k 2kn kn kn kn kn 2 + wordsize(m) + k n > k 2 + wordsize(m) n > +1 k 2 + wordsize(m) n = +1 k < < < < <

(2.22)

So an additional cycle for the computation is needed, but one multiplication is saved in each iteration for computing the quotient q. The next version (algorithm 6) shows how to avoid the addition in the quotient determination. This is done by replacing a with 2k a. To compensate the extra factor 2k , an additional iteration is needed for multiplication. Algorithm 6 Montgomery multiplication algorithm, version 3 Input: a, b, k, m Output: p = a b r1 mod m and 0 p 2 m Require: m > 2 with gcd(m, 2) = 1 Require: k, n such that 4 m < 2kn where m = (m mod 2k ) m kn 1 Require: r, m with 2 r mod m = 1 and m m mod 2k = 1 Require: a with 0 a 2 m l1 Require: b with i=0 (2k )i bi and 0 b 2 m 1: p = 0 2: for i = 0 to n do 3: q = p mod 2k 4: p = p+qm + bi a 2k 5: end for 6: return p In this version q is computed from the remainder of the intermediate sum modulus the base 2k . This equation can easily be implemented in hardware, because 14

only the lowest k bits need to be taken from the intermediate sum. The addition of p + q m is a very expensive arithmetical operation, because of the possible word size of the two operands. So very fast adders would be necessary to achieve a high-speed multiplication with this algorithm. Therefore, it is a good idea to recompose the statement for p to avoid such an expensive addition. The following equation shows the rewritten statement. p+qm + bi a = k 2 q m + p mod 2k p + + bi a k 2 2k q (m + 1) p + bi a = k+ 2 2k p m+1 = k +q + bi a 2 2k This transformation leads to the nal algorithm 7. = Algorithm 7 Montgomery multiplication algorithm, version 4 Input: a, b, k, m Output: p = a b r1 mod m and 0 p 2 m Require: m > 2 with gcd(m, 2) = 1 Require: k, n such that 4 m < 2kn where m = (m mod 2k ) m kn 1 Require: r, m with 2 r mod m = 1 and m m mod 2k = 1 Require: a with 0 a 2 m l1 Require: b with i=0 (2k )i bi and 0 b 2 m = m+1 1: m 2k 2: p = 0 3: for i = 0 to n do 4: q = p mod 2k 5: p = 2pk + q m + bi a 6: end for 7: return p The advantage of algorithm 7 compared to algorithm 6 is the dataow independence in each iteration. Because of this, each term can be computed in one iteration without aecting the other terms. The constant m must be calculated only once for each modulus m. As mentioned before, the computation of q can easily be done by using only the lowest k bits of the previous intermediate sum. The term 2pk also can be easily computed by a right-shift operation by k bits. The bottlenecks for a highspeed design are the two remaining multiplications and the addition of the three terms to obtain the intermediate sum. Therefore, the next two sections treat the theoretical background of high-speed multiplication and high-radix multiplication. The design and implementation of these essential parts are discussed in chapter 3. 15

(2.23)

2.7

Multiplication
al1 al2 . . . a1 a0 bl1 bl2 . . . b1 b0 p2l1 p2l2 . . . p1 p0

The following notation is used during the next consideration. a Multiplicand b Multiplier

p Product (a b)

The multiplication of two binary numbers is realized by shifting and adding. Figure 2.1 shows a multiplication of a six bit with a four bit binary number in dot notation (see [Par00]). The partial products bi a can be easily formed by a logical AND operation. This scheme also applies to other non-binary multiplications, but computing the terms bi a becomes more dicult and the product will be one digit wider than a. The remaining multi-operand addition is the same.

Figure 2.1: Multiplication of two binary numbers in dot notation Instead of shifting the terms bi a, it is easier to shift the cumulative partial product. Therefore, there exist two version for one-bit-a-time multiplication, depending on the order of adding the terms bi a. Processing Figure 2.1 from top to bottom leads to the right-shift algorithm 8, from bottom to top leads to the left-shift algorithm 9. In the shown algorithms l is the word size of the multiplier b. Algorithm 8 Binary right-shift multiplication algorithm Input: a, b Output: p = a b 1: p = 0 2: for i = 0 to l 1 do 3: p = (p + bi a 2l ) 21 4: end for 5: return p

16

Algorithm 9 Binary left-shift multiplication algorithm Input: a, b Output: p = a b 1: p = 0 2: for i = 0 to l 1 do 3: p = 2 p + bli1 a 4: end for 5: return p

2.8

High-Radix Multiplication

As mentioned before, there is no consistent denition of the multiplication radix in literature. In this thesis the radix is dened as 2k , where k is the number of bits which are processed in one iteration during the multiplication. In other words, the base of the number is changed to the radix 2k . The modied right-shift version and left-shift version of the radix-r multiplication is shown in algorithm 10 and algorithm 11. Algorithm 10 Radix-r right-shift multiplication algorithm Input: a, b Output: p = a b 1: p = 0 2: for i = 0 to l 1 do 3: p = (p + bi a rl ) r1 4: end for 5: return p

Algorithm 11 Radix-r left-shift multiplication algorithm Input: a, b Output: p = a b 1: p[0] = 0 2: for i = 0 to l 1 do 3: p = r p + bli1 a 4: end for 5: return p Figure 2.2 (see [Par00]) shows an example of radix-4 multiplication in dot notation. So for a radix-4 multiplication two bits a time are needed to form the partial product. In each step the partial product (bi+1 bi ) a needs to be formed and added to the intermediate partial sum. When using radix-2, processing one bit in each cycle, 17

Figure 2.2: Radix-4 multiplication in dot notation

the partial product can easily be formed because only 0 a and 1 a need to be formed. This can be done by masking. With higher radices this computation is not that simple anymore. For example, with radix-4 the multiples 0 a, 1 a, 2 a and 3 a are needed. 0 a and 1 a can be computed as before, 2 a requires a shift operation. In this case the problem is how to compute 3 a. A solution would be the pre-computation of 3 a and to store it for future use in a register. Figure 2.3 (see [Par00]) shows a hardware example for the multiple selection of a radix-4 multiplier.

Figure 2.3: Operand selection for radix-4 multiplication

The advantage is that this can easily be done even with higher radices. The disadvantage is the pre-computation of all operands which are not a multiple of 2i . With radix-4 there is only one pre-computation needed, for radix-8 there are three pre-computations needed (3 a, 5 a and 7 a). In general, each prime number in the used radix set has to be pre-computed. For very high radices, this slows down the whole computation, since the pre-computation must be performed before each multiplication. Another approach for high-radix multiplication is the use of a redundant number representation. 18

2.8.1

Redundant Number Representation

For a non-redundant number system there exists only one representation of any number, for example the binary, octal, decimal, hexadecimal number systems. When using redundant number systems, there is more than one representation for a certain numberdepending on the radix. For example, a radix-2 digit set with [0,1,2] can be used to represent the number 6 as (110) or as (102). When using a radix r digit set [, ], the following cases can be distinguished Signed > 0 Unsigned = 0 Non-redundant = r 1 Redundant > r 1 Redundant number systems can be used to recode a binary number so that only simple operations like masking and shifting are required to compute the necessary multiples of the multiplicand. A very popular recoding scheme is modied Booths recoding.

2.8.2

Modied Booths Recoding

The idea behind Booths recoding is the detection of strings of 1s. The recoding is performed by the scheme shown in table 2.1. The binary number is represented by xi with i = 0 to i = l 1 and l as the word size. Per denition x1 = 0. The recoded number is represented by yi with i = 0 to i = l. xi 0 0 1 1 xi1 0 1 0 1 yi 0 1 -1 0 Explanation No string of 1s End of a string of 1s Beginning of a string of 1s Continuation of string of 1s

Table 2.1: Radix-2 Booths recoding scheme If a string of 1s starts, the multiplicand is subtracted from the cumulative partial product. If a string of 1s ends, the multiplicand is added to the cumulative partial product. If there is a continuing string of 1s or 0s, nothing is added to the cumulative partial product. In other words, a continuing string of 1s will be represented as 2j + 2ji + + 2i+1 + 2i = 2j+1 2i . (2.24)

The longer the sequence of 1s, the larger savings can be achieved, for example: 19

1 0 0 1 1 1 1 1 1 1 1 0 1 1 0 -1 0 1 0 0 0 0 0 0 0 -1 1 0 -1 1

1 binary number -1 recoded number

But if there are only isolated 1s in the binary number, no savings can be achieved, for example: 1 0 1 0 -1 1 -1 1 1 0 1 0 1 0 -1 1 -1 1 -1 1 1 0 1 0 -1 1 -1 1 1 binary number -1 recoded number

With radix-2 this does not make sense in designs where the data path goes through the adder anyway. But it can be used in micro programs, because shifting alone is faster than addition followed by shifting. With radix-4 it also makes sense for designs where the data path goes through the adder. The disadvantage of radix-4 multiplication was the need of an extra addition to form the 3 a multiple. With Booths recoding it is possible to convert the radix-4 digit set [0,3] to [-2,2]. Accordingly, there is no need to compute the partial product 3 a anymore. The recoding scheme is shown in table 2.2. The binary number x is converted to z, y is the radix-2 conversion of x. xi+1 0 0 0 0 1 1 1 1 xi 0 0 1 1 0 0 1 1 xi1 0 1 0 1 0 1 0 1 yi+1 0 0 1 1 -1 -1 0 0 yi 0 1 -1 0 0 1 -1 0 zi/2 0 1 1 2 -2 -1 -1 0 Explanation No string of 1s End of a string of 1s Isolated 1 End of a string of 1s Beginning of a string of 1s End one string, begin new string Beginning of a string of 1s Continuation of string of 1s

Table 2.2: Radix-4 Booths recoding scheme Thus, if radix-4 recoding is performed, only the multiples a and 2 a of the multiplicand are required. All of them can easily be obtained by shifting and other low level logical functions like AND and XOR. Figure 2.4 (see [Par00]) shows an example hardware design of a radix-4 Booth-recoded multiplier. The disadvantage of Booths recoding is the need for signed numbers. So an additional sign bit and hardware are required to form the twos complement of a binary number. When using radix-8 Booths recoding scheme (see table 2.3), the multiple 3a appears again. This fact does not change in radices higher than eight. Therefore, pre-computation is needed with radices greater than four. To avoid this pre-computation, carry-save adders can be used. 20

Figure 2.4: Radix-4 partial product generation with Booth recoding xi+2 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 xi+1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 xi 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 xi1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 yi+2 0 0 0 0 1 1 1 1 -1 -1 -1 -1 0 0 0 0 yi+1 0 0 1 1 -1 -1 0 0 0 0 1 1 -1 -1 0 0 yi 0 1 -1 0 0 1 -1 0 0 1 -1 0 0 1 -1 0 zi/3 0 1 1 2 2 3 3 4 -4 -3 -3 -2 -2 -1 -1 0 Explanation No string of 1s End of a string of 1s Isolated 1 End of a string of 1s Isolated 1 Isolated 1 Isolated 1s End of a string of 1s Beginning of a string of 1s End one string, begin new string End one string, begin new string End one string, begin new string Beginning of a string of 1s End one string, begin new string Beginning of a string of 1s Continuation of string of 1s

Table 2.3: Radix-8 Booths recoding scheme

2.8.3

Carry-Save Adders

The basic operation of the multiplication is the addition. Therefore, fast adders are needed to build fast multipliers. That is particularly important when operating with word sizes beyond 1024 bits (for example RSA). There are many dierent fast adder designs. The most important ones are carry-propagation adder, carry-look-ahead adder, carry-skip adder, and carry-save adder. An excellent description of their detailed functionality can be found in [Par00]. However, the architecture of fast 21

adders always results in higher hardware complexity. In this section the carry-save adder is explained in detail, because it is often used within the RSA design. Carry-save adders (CSA) can be obtained from a full-adder (FA) cell. Figure 2.5 shows the conversion from a ripple carry adder to a carry save adder.

Figure 2.5: Conversion of a ripple-carry adder into a carry-save adder The disadvantage of a ripple-carry adder is the long carry chain when operating on big word sizes. The last full adder has always to wait for the carry from the previous full-adder cells. In other words, ripple-carry adders have a long critical path. This decreases the computation time signicantly. Analyzing the timing behavior can be performed with the O-notation. The O-notation shows the needed computation time for innite input. Therefore, a ripple-carry adder has a computation time of O(n), because n full-adder cells form the critical path. With O-notation it is also possible to compare dierent adder architectures. But it is necessary to keep in mind that in real world no innite numbers are used as input. So CSAs are saving the carries instead of propagating them. Figure 2.5 shows also that a CSA is made up of ordinary full adders. The Boolean equations for a full adder cell are:

sum = a b c carry = (a b) + (a + b) c = (a b) + (a c) + (b c)

(2.25)

Where is the Boolean XOR function, is the Boolean AND function and + the Boolean OR function. For the sake of completeness, the truth table of this operators is shown in table 2.4. In other words, a CSA is a 3-to-2 reduction circuit. Therefore, only multioperand addition makes sense with carry-save adders. CSAs can be combined to a tree. A CSA tree reduces n binary numbers to two binary numbers, which are 22

AND a b 0 0 0 1 1 0 1 1

() ab 0 0 0 1

a 0 0 1 1

OR (+) b a+b 0 0 1 1 0 1 1 1

XOR () a b ab 0 0 0 0 1 1 1 0 1 1 1 0

Table 2.4: Basic Boolean operations representing the sum and the carry. There are several ways to convert the redundant representation of the result into the binary representation. They are explained later in this section. Figure 2.6 shows a 6-to-2 compression tree, also called Wallace tree.

Figure 2.6: 6-to-2 CSA reduction tree Using CSA trees within multiplication architectures reduces the critical path signicantly. The height of the Wallace tree can be estimated with h(n) log1.5 n . 2 The logical depth is equivalent to the height of the tree. Compared to a bit-serial multiplier (single bit a time multiplier), designs as shown in gure 2.7 reduce the number of computation cycles for one multiplication. The disadvantages are the need for additional area (compared to a two-operand addition) and the nal addition which has to be done by a ripple-carry adder, carrylook-ahead adder, carry-skip-adder, or any other carry-propagate adder. Especially when using large word sizes like 1024 bits or more, the nal addition is a very time consuming operation. The multiplier architecture has to pay special attention to the nal adder, because the nal adder could contribute signicantly to the critical path of the circuit. The nal addition can also be performed by the CSA tree itself. This is realized by using the result of the adder tree as input and setting all other operands to zero. 23

So it is necessary to know how many times this operation has to be performed. The worst case of this transformation is depending on the used word size. In each computation cycle the carry is shifted one bit to the left. Therefore the transformation is nished after word size cycles. 1 Parhami shows in [Par00] that the probability of carry generation is 4 , the probability of carry annihilation is 1 and the probability of carry propagation is 1 . 4 2 Therefore, the averrage length of a carry chain could be estimated with log2 of the used word size. When using a word size of 512 bits the estimated length is nine, with 1024 it is ten, and with 2048 it is eleven. This is very low compared to the needed computation cycles of a multiplication, and has to be done only once at the end of the multiplication. The disadvantage is that a comparator is required to make sure that the carry is zero. CSA trees can be used to implement fast multipliers with or without Booths recoding. Figure 2.7 shows a radix-4 multiplication without Booths recoding.

Figure 2.7: Radix-4 multiplication with carry-save adder It is also possible to keep the cumulative partial product in stored-carry form (see [Par00] for more details), but an additional register for the carry is needed, which increases the needed area. With such adder trees it is possible to implement highradix multipliers with very short logical depth. This is an important requirement to achieve high clock rates. Therefore, the multipliers (section 4.2.7 used in the RSA implementation were also built with Wallace trees. In conclusion, it is recommended to design high-radix multipliers with CSA trees.

24

Chapter 3 Design Methodology


This chapter discusses the design methodology, from the software frontend to the hardware backend. The used design tools for creating software and hardware are treated, and how they are combined to achieve great design results. This also includes the optimization on dierent abstraction layers like the software layer, the algorithmic layer, the register-transfer level, place and route, and much more.

3.1

Design Flow

For designing complex hardware, which RSA is without doubt, development, simulation and verication tools are necessary. The general design ow of this RSA implementation consists of the following major points: Get information about state-of-the-art implementations. Implement a high-level model with an object-oriented programming language (Java). Compare dierent algorithms, and estimate the costs of their hardware implementation. Choose an appropriate architecture. Implement the hardware with a hardware-description language (VHDL, Verilog, ...). Verify the hardware by simulation and compare the results with the high-level model. Synthesize the hardware. Simulate the hardware at gate level and compare the results with the high-level model. 25

Generate an FPGA conguration le and compare the results from the FPGA with the high-level model. Optimize the implementation for the target platform. The design ow for FPGAs is shown in gure 3.1. The following sections describe the mentioned points in more detail. The behavior behind the used tools, for example synthesis, is also explained.

Figure 3.1: FPGA design ow

3.1.1

State of the Art Implementations

To keep the time to market short, it is necessary to study state-of-the-art implementations. So it is possible to get an overview of successfully implemented designs. With this knowledge, possible failures should be minimized during the design phase. 26

It is highly recommended to review the design process periodically, in order to nd possible design failures as soon as possible, since the faster they are discovered, the easier it is to correct them. Therefore, the costs for correcting failures can be minimized. There is a huge number of publications about the implementation of high speed RSA, for example [YCC98, SV93, EW93, Wal91, OK91, Oru95, LW99, LW00, BP99]. Most of them are based on the Montgomery multiplication algorithm. Holger Orup has introduced the rst hardware design operating on full word size. Orups implementation was about ve times faster than the fastest implementations. At the time this paper was written, there were no known implementation of Ourps algorithm based on FPGA platforms. It is not possible to achieve better results with an implementation on an FPGA compared to Orups CMOS implementation because on CMOS implementations the routing delay will always be less than on FPGA platforms.

3.1.2

High-Level Model

It is very useful to implement a high-level model in an object-orientated language. The implementation was done with Java. The advantage of Java is the support of big numbers, which are encapsulated in the BigInteger class. With a high-level model it is easier to understand the background of the used algorithms. The rst high-level model implements RSA with the Montgomery algorithm. Based on this version, Booths recoding and Orups optimizations where added. At this point an incompatibility of Booths recoding and Orups optimizations was determined when using radices higher than two. Therefore, the nal algorithm, which is implemented on the FPGA, is the Montgomery algorithm with Orups optimization but without Booths recoding. Another advantage of this decision is that without Booths recoding there is no need for a signed number representation, which makes things easier to implement in hardware. Starting the implementation with ordinary operators like +, -, *, / or mod, they can be replaced successively by more hardware-oriented operations. The result is an algorithm which can directly be implemented by a hardware description language. For example, the implementation of a carry-save addition starts with the + operator and is replaced by

sum = a b c carry = (a b) + (a c) + (b c)

(3.1)

Another advantage of a high-level model is the ability to verify the own implementation with other correct implementations. Especially for RSA there are many implementations. When the implementation of the algorithm is completed and it is working correctly, the intermediate results of the algorithm should be saved in order 27

to compare them with the simulation results of the register-transfer-level model. This method oers a fast way to nd as many design errors as possible in a very short time span. Therefore, the design costs stay low. Another reason is that the implementation of test benches is easier in an object-orientated language, therefore the potential sources for faults are less, compared to implementations in hardware. For such testing purposes, an interface to the hardware is needed. How to implement such an interface with Java and C++ is shown in section 3.2.

3.1.3

Hardware Description Language

Hardware description languages (HDL) are needed to implement parameterizable hardware. They are used to describe hardware at dierent abstraction levels. Design descriptions of a HDL are divided into four main sections: Denitions: Dening variables, signals, ports, and structures of hardware. Behavioral description: Describing the dependence between inputs and outputs with operators over time. Structural description: Combining functionality in blocks and connect their inputs and outputs using wires. Constraints: To ensure correct implementations of the synthesized hardware under certain conditions as area limitations or clock frequency requirements. HDLs normally follow the concept of general programming languages plus additional data types and operators which fulls the need of concurrent execution. For the control structures, there are three main concepts: Local attempt: Describing when a local command should be executed. Extended structural programming: The common language concepts for programming (if then else, for do, while do, ...) are extended with additional constructs being able to handle concurrency. Process communication: The whole system is a set of processes which are communicating in a dened manner. Common programming languages have the ability to describe the behavior, but they do not oer concepts for describing hardware-specic characteristics. Therefore, a HDL should be able to describe interfaces by dening input and output signals and ports, which oers the ability to communicate with other modules. Structural description is used to keep the overview over complex designs. This is a kind of hierarchical abstraction. In addition to existing operators, operators depending on the register-transfer level and the logic level are needed. In the register-transfer level 28

also asynchronous behavior is needed to describe signals like interrupts, reset, and set. For implementing the hardware, VHDL was chosen as hardware description language. Figure 3.2 shows the comparison between VHDL and Verilog. Both languages have nearly the same capability to describe hardware. The VHDL code was written with the emacs editor, because of its great VHDL mode. This was one of the reasons why VHDL was chosen as hardware description language.

Figure 3.2: Comparison between VHDL and Verilog It is also necessary to know how to write well synthesizable code, some code examples are described in section 3.3.5 and in chapter 4. Some hints for writing good synthesizable code for XILINX devices can be found under [xil04].

3.1.4

Simulation

Simulation is used to verify the functionality of the described hardware. Ssimulation can be done on dierent abstraction layers. These are the simulation of the behavior models and simulation of the gate-level model. Simulation of Behavioral Models Behavioral models are written in an editor or another front-end tool. With behavioral models it is possible to verify design at an early stage of the project, it also is used as input of the synthesis tools. To verify the behavioral model, a test script is used. The test script provides stimuli for the input signal and compares the output signal with the expected output values. The test script must be able to read and write from interface of the behavioral model. Before the verication of the 29

functional correctness the syntactical correctness has to be veried. The simulation is fast because it is time discrete and level discrete. There is no computation of transient signal transitions, there are only three logical levels which are used during the simulation process high, low, and tristate. Simulation of Gate-Level Models The gate-level model will be generated with synthesis tools making use of target specic logic models. The simulation is used to compare the behavioral model with the gate-level model. Also the timing constraints of the gate-level model can be checked. It takes much more time to simulate the gate-level model than to simulate the behavioral model. Simulation of the RSA HDL Model The simulation of the RSA chip was performed with ModelSim from Mentor Graphics. The test script was implemented with TCL. The test bench compares the intermediate results from the high-level model with the results of the behavioral model. Therefore, methods where needed which are able to compare big numbers in TCL, because the range of an unsigned integer is limited to 4294967296 (32 bit). The simulation of one modular exponentiation takes about ve minutes on a INTEL Pentium IV machine running at 1.6 GHz. For detailed analysis it is also possible to use a waveform viewer.

3.1.5

Synthesis

Aim of the circuit synthesis and optimization is the generation of a technologydependent netlist on the gate level that should be optimized under most dierent points of view, like for example minimal area demand or maximum achievable clock frequency. Besides conventional netlists, HDL behavior descriptions can also be used as input. The steps of synthesis and optimization are displayed in gure 3.3. The rst step during the synthesis is transformation of the input, which can be a behavioral description of a HDL model or a netlist, to Boolean equations. If the input is a netlist on gate level, the extraction of Boolean equations requires the functionality of the associated technology library. Afterwards the Boolean equations are available in a structured form. That means for each output an equation and for each signal a variable is generated. If the input is a HDL behavior description the synthesis is done by linking known function blocks (after the syntax check) on register-transfer level. Since each digital hardware can be described by state machines, it is necessary to nd out how many states are needed and which is the best encoding of them. For example onehot encoding needs more hardware resources but is faster than binary encoding. An optimization is achieved by favorable selection of the function blocks. For the follow30

Figure 3.3: Hardware synthesis ing optimization steps, a block-oriented description is available in form of Boolean equations. On the basis of these Boolean equations, the technology-independent logic optimization can be enforced. Aim of this optimization is an improved structure of equations, so that redundant logic can be removed in order to reduce the needed area and time delay of the circuit. Boolean algebra is used as basic reduction tool for these optimizations. Other optimization functions are logic attening and logic structuring. The resulting equations can be understood also as a technology-independent gate-level netlist. In the last step, a technology-dependent netlist is generated on the basis of a corresponding gate library, this procedure is called Mapping. Then the Place & Route procedure is accomplished. Depending on the hardware design it is possible that the routing requires additional cells of an FPGA. The goal pursued with this process is to meet the constraints which were set by the developer and the design 31

rules dened in the technology library. The observance of the design rules has more priority than the optimization goals. It is important to distinguish between place and route on FPGA platforms and other production processes like CMOS, because there is no need for a geometrical rule-set on FPGA platforms. For the synthesis and the place and route procedure, the ISE development studio was used. It is only available for Xilinx devices, but it is possible to use third party tools like SYNPLIFY for the synthesis. The whole procedure can also be executed from command line. This oers an easy way of nightly builds. Nightly builds are necessary, because generating a bit stream can take up to four days on an AMD Athlon XP 2000+ or a 1.6 GHz INTEL Pentium IV CPU. The generation time can be reduced signicantly when using more RAM because the place and route procedure is a very memory demanding. On a 2.8 GHZ INTEL Pentium IV system with two gigabyte RAM it takes about four hours to generate a bit-stream le.

3.2

Software

Since programming languages are not hardware independent, it is necessary to write software in such a manner that the dependent parts can be exchanged easily. During the implementation of software, many programming languages need to be used. So, it is necessary to use well dened interfaces on a dedicated abstraction layer to keep the software as platform-independent as possible. Choosing a programming language is not that dicult. It depends on the particular requirements of the application, the surrounding environment (interface), and personal preferences. For this application Java was chosen because a big number representation is already included (BigInteger class), which also provides basic algorithmic functions like addition, subtraction, multiplication, division, modular reduction, and modular exponentiation, a cryptographic interface is already implemented (JCE), it is easy to use with JSP, it is possible to access low-level C or C++ functions via the Java Native Interface (JNI), is available on almost all platforms. When writing software it is recommended to use a design pattern which satises all requirements of the application. Many software applications have a common functional behavior, so it is not necessary to do all the development work again. In this case the Layer Pattern is used. More information about patterns can be found in [Bus96]. 32

3.2.1

The Layer Pattern

The layer pattern is a so-called architectural pattern. There are many applications using the layer pattern, for example the ISO-OSI network model, the TCP/IP protocol and nearly any Application Programming Interface (API). Figure 3.4 shows some examples for the layer pattern

Figure 3.4: Examples for the layer pattern The intention is to decompose the application into layers of abstraction. The solution is to group similar functions in one layer. Between two dierent layers, there is always a standardized interface. This makes it possible to replace the implementation of a layer without changing other layers. Applications on a particular layer may only call functions of their subordinate layer(s). This may arise the impression of slowing down the whole application, but in this case all function calls work in call-back manner, so the only overhead is putting parameters on the stack.

3.2.2

Software Architecture

Figure 3.5 shows the design of the architecture and the used programming languages for each layer. If the application is to be ported to another system like Linux, it is only necessary to replace the layers between the JNI interface and the hardware. The JNI interface 33

Figure 3.5: Software architecture

also implements the interrupt service routine. Interrupts have an asynchronous behavior and therefore they must be synchronized with function calls from the upper layer. The implementation is shown in section 4.1.2.

3.3

Hardware

When implementing hardware, it is necessary to keep in mind for which target environment the hardware is designed. For example a Full-Custom Design is completely dierent from a design for FPGA devices. In full-custom designs the usage of registers is very expensive because of the needed area. When designing hardware for a FPGA platform, registers are relatively cheap because each slice contains up to two ip-ops (see 3.3.3). But there are also some common design rules, like separating the hardware in a control path and a data path. 34

3.3.1

Control Path and Data path

Most hardware applications can be separated into a control path and a data path. The data path typically has some pipeline stages (register banks) and a huge block of combinational logic which shows regular structures. The control path, on the other hand, is mostly implemented through state encoding and therefore is relatively small compared to the data path. The control path and the data path are connected trough control signals and status signals, therefore the control unit also aects the delay in the data path. How to separate control and data path to minimize these eects is shown in section 3.3.6. The hardware implementation of the RSA algorithm was also separated in a control path and a data path. It is shown in gure 3.6.

Figure 3.6: Separating hardware in control- and data path The detailed architecture is described in chapter 4.2.1.

3.3.2

Synchronizing Clock Domains

To communicate with other hardware, an interface to surrounding environment is needed. In many cases a bus is used to connect hardware. Typically the bus frequency is dierent from the clock frequency of the device. Therefore, synchronization between the two clock domains is needed. First o all, each ip-op has a typical setup and hold time, during this time span the input signal has to be stable, otherwise the ip-op may drop into a meta-stable state. In a meta-stable state the 35

output has no denite logic level. Figure 3.7 shows how the input of a ip-op is moved to the output.

Figure 3.7: Clock to output delay of ip-ops Figure 3.8 shows the meta-stable state of a ip-op when the input is changed during setup tsu and hold time th .

Figure 3.8: Meta-stable states of ip-ops To synchronize signals or data from one clock domain to another, it is necessary to ensure that the input does not change during the setup and hold time. There are three basic approaches to avoid meta-stable states: 36

Using a double synchronizer. It consists of 2-bit shift register which is clocked by the destination clock. This kind of synchronization should only be used for a single signal. When using this circuit for synchronizing a data bus, it is possible that one bit is captured a clock cycle before another bit, so the transferred data is not consistent anymore. Figure 3.9 shows a signal synchronization circuit with two ip-ops connected as a shift register. Additional ip-ops can be added to the shift register to increase the probability of avoiding meta-stable states. With each additional ip-op, the probability to fail will be roughly 1 , therefore the probability of a meta-stable state after two ip-ops is 1000 1 approximately 1000000 . Using a handshake circuit. This circuit consists of a FIFO, additional handshake logic and signals for request and acknowledge. It is very dicult to design such handshake circuits without decreasing the throughput. Therefore, if it is possible to ensure stable data during the setup and hold time of the ip-op, these kind of circuits are avoided.

Figure 3.9: Signal synchronization with double synchronizer When using the PCI bus, additional logic is needed to prevent reading and writing at the same time, because this could destroy the IO buer of the PCI bus. The implementation and synchronization of the presented RSA circuit is shown in section 4.2.6.

3.3.3

Architecture of FPGAs

An FPGA basically is a recongurable hardware. The actual functionality always depends on loaded conguration le. When the power supply is switched o, the conguration is lostremains the same. Therefore, an FPGA always has an conguration interface, for example a serial interface or a parallel interface. This oers the possibility of fast hardware developing. 37

Before writing hardware for FPGA devices, the architecture of the used FPGA should be known. There are many possibilities to optimize a circuit for FPGA devices. In this project, Xilinx FPGAs are used but the basic FPGA architecture is the same also on other FPGA devices. The typical layout of a FPGA consists of logic blocks surrounded by input/output blocks. There are many names for a logic block, for example Congurable Logic Block (CLB), Logic Module, Logic Cell, Logic Element (LE), Programmable Function Unit (PFU), Core Cell or Tile. Other names for an input/output block are I/O Block (IOB), I/O Module, I/O Cell, I/O Element (IOE), Programmable Input/Output Cells (PIC). The dierent the names of logic blocks are, the dierent are their implementations. An important concept is the granularity. High granularity means less combinational logic in one logic block. FPGAs with a high granularity make it easier for synthesis tools to map the logic equations to the hardware because they do not have to optimize it for large logic blocks. The disadvantage of high granularity is that more signals are needed for routing, therefore the achieved clock rates are decreasing (see [BFRV92]). There exists no perfect granularity. It depends on the application which granularity is the best. There are three major strategies to implement a logic block: Lookup Tables (LUT). LUTs are a kind of SRAM talbe where the address signals are the logical input and the addressed value is the logical output. The values of the SRAM table are set during the conguration of the FPGA. Multiplexers. The logical function is realized by multiplexers. Gates only. The gates are arranged in an array. It is possible to map equations with up to eight inputs. Because of the dierent implementation of FPGAs it is dicult to compare them. At the beginning the comparison was done by the number of gates. A gate is dened as NAND module with two inputs. Nowadays this comparison does not make sense anymore because todays FPGAs are already equipped with built-in memory and small CPUs. Once again the choice of the most suitable FPGA depends on the application to be implemented on it.

3.3.4

Xilinx FPGAs

The pictures and detail information in this section are taken form the XILINX Spartan-II data sheet and can be found under [xil05]. Figure 3.10 shows the architecture of a Xilinx Spartan II device. The huge array of CLBs is surrounded by IO buers and DLLs. Figure 3.11 shows the combination of two slices in one CLB. 38

Figure 3.10: Architecture of Xilinx Spartan II

Xilinx is using LUTs for implementing a logic block. It depends on the product family which and how many additional elements like multiplexers and registers are used. Each Xilinx FPGA has at least two input LUTs (with four inputs) per slice and two slices per CLB. Figure 3.12 shows the schematic of one slice within a Xilinx Spartan II FPGA. The major advantage of the LUT architecture is that the four-bit input LUT can also be used as single-port RAM (SPRAM) or as dual-port RAM (DPRAM). Therefore, it is possible to address 16 bits instead of saving just one with a single ip-op. Each slice also has a fast carry line, which can be used to implement fast counters, adders, or subtractors. It is also possible to implement fast comparators. There are two ways to use such built-in functions. The rst way is to instantiate hard macros from the tool library, the second possibility is to write the VHDL code in such a manner that the synthesizer can map it to a built-in function. The Virtex II model line has some additional interconnection circuits in each CLB, allowing short net delays with the surrounding CLBs. The Virtex IV family already has small built-in CPUs and Rocket IO Buers for transfer rates up to ten gigabits per second. 39

Figure 3.11: Xilinx Spartan II CLB implementation

3.3.5

Designing for FPGAs

With underlying LUT technology, the best mapping can be achieved using logic circuits with four inputs and one output. When implementing Boolean equations with only two inputs, also one LUT needs to be used, therefore the other inputs cannot be used anymore. This means that the utilization ratio is only 50 percent. Even if three bits were used, the utilization ratio is only 75 percent. If it is possible to rewrite a Boolean equation to use a whole 4-bit LUT, the synthesizer will map it during the optimization phase. The advantage of larger logic blocks is that no routing is required for the represented Boolean equation, therefore the input-tooutput delay can be minimized. When designing complex circuits, it is essential to keep neighboring blocks together to minimize the routing delay. In VHDL this, is achieved by grouping such functions in one entity, then the synthesizer will keep them next to each other. Whenever it is possible to map functionality into built-in structures of the FPGA, this should be done. The following built-in functions are available on from the Spartan II family up to the Virtex IV family: Single-Port RAM Dual-Port RAM Adders, substractors 40

Figure 3.12: Xilinx Spartan II slice implementation Counters Comparators Block RAM Linear feedback shift registers Parallel loadable shift registers The maximum available word size of these built-in functions depends on the Xilinx family. More informations about the built-in functions can be found in [xil04].

3.3.6

High Performance Optimizations

The following section describes some methods for maximizing the throughput of circuits. Clock Frequency A general rule to achieve high performance designs is to maximize the clock frequency. The clock frequency has linear inuence on the throughput of the circuit. 41

High clock frequency can be reached by keeping combinational paths as short as possible. If there is a long combinational path in the circuit, it can be shortened by including intermediate registers. This method is also called pipelining. The result is therefore delayed over several clock cycles depending on how much pipelining is applied. This method is very well suited for FPGAs because after each combinational block a ip-op can be used. Too much pipelining can also have negative eects. First, counters are needed to nd out when the result is coming out of the pipeline. Second, if the design is very huge, and already a lot of routing needs to be done, then pipelining causes additional routing which will decrease the clock frequency. Therefore, it is important to nd the right amount of pipelining. Buering Control Signals If the used design is split into a control- and data path, each signal from the control unit to the data path should be buered. The reason is the combinational logic of the state machine. If no buering is used, the longest logical path of the state machine is added to the one of the data path. This can decrease the clock frequency signicantly. The only disadvantage of using buered signals between the state machine and the data path is: if the state machine has to wait for status signals from the data path, it has to wait one additional clock cycle because the control signal from the last clock cycle will be present on the data path. Using Trees Whenever it is possible, trees should be used to combine signals. This can be binary trees, Wallace trees, Dadda trees and so on [Par00]. The major advantage of a tree is that the length of the logical path is equivalent to the height of the tree. This is the reason why such trees are often used in multiplier circuits. For example, if a binary tree is used, the depth of the tree is given by height = log2 (width). (3.2)

So if the tree has a width of 8, the depth is 3, but if the tree has a width of 1024, the depth is only 10. In other words: when doubling the width, the depth is increased only by one.

42

Chapter 4 Implementation
This chapter describes important parts of the software and hardware implementation.

4.1

Software

As shown in chapter 5.1 the achieved word size on the Spartan II device are relatively small. Therefore, only a application which measures the throughput of the RSA implementation and compares it with the available software RSA implementation was written. The software can be executed on command line or in combination with an Apache Tomcat server over a HTML site. To execute a Java program over a HTML site a Java Server Pages (JSP) implementation is needed. To avoid problems with JSP, the used Java programs should always be a part from a package, otherwise it is possible that the Tomcat server cannot nd the needed methods. More information about the Jakarta project, which is the ocial reference implementation of JSP, can be found in [tom04]. Before it is possible to use the hardware accelerated version of RSA a JCE provider needs to be implemented.

4.1.1

JCE and JCA

With the Java cryptography extension it is possible to implement your own ciphers, key agreement schemes, and message authentication codes. All the feature are encapsulated into a provider class. During this work a provider with the name myProv was implemented. This provider supports only one cipher with the algorithm name accRSA. To get an instance of this cipher and use it for encryption and decryption the following code must be executed: Cipher rsaCipher; // Create the cipher rsaCipher = Cipher.getInstance("accRSA", "myProv"); 43

// Initialize the cipher for encryption rsaCipher.init(Cipher.ENCRYPT_MODE, rsaKey); // Our plaintext byte[] plaintext = "I can read this!".getBytes(); // Encrypt the plaintext byte[] ciphertext = rsaCipher.doFinal(plaintext); // Initialize the same cipher for decryption rsaCipher.init(Cipher.DECRYPT_MODE, rsaKey); // Decrypt the ciphertext byte[] decryptedtext = rsaCipher.doFinal(ciphertext); More information on implementing and using the Java cryptography extension can be found in [jce04b, jce04a]. The throughput results are shown in section 5.2. As mentioned in section 3.2.2, C++ is used as programing language to communicate with the RSA hardware implementation. Therefore, the Java program needs to call these methods to use the advantages of the RSA hardware implementation. This is done by the Java native interface (JNI) described in the next section.

4.1.2

JNI Interface

JNI allows Java to call pure C or C++ methods from a Java program. Methods which should call such C or C++ functions must be declared as public native methods. The following code snippet shows the denition of JNI methods on the Java side. public class rsa_interface { public native void writeR2(byte[] value); ... public native boolean startComputation(); public native byte[] readResult(); static { System.load("c:\\rsa\\rsa.dll"); } 44

Every time when a new instance of this class is created the dynamic link library (DLL) rsa.dll will be loaded. All needed methods must be implemented in the rsa.dll, otherwise the execution will cause an undened reference error. From this Java le a C header le is generated by using the javah command. It is necessary to know that this header need to be regenerated if the Java package name or any method name changes. Then it is also necessary to recompile the used DLL, otherwise Java cannot execute the changed methods anymore. For more information on implementing the JNI interface look at [Ste04]. The implementation of the rsa.dll was done by using the Microsoft Visual Studio .NET development environment. The computational time for a modular exponentiation depends on the used exponent, so it is necessary to know when the computation is nished. This can be done either by polling a status register or by waiting on an interrupt signal. Using interrupts is the better solution because no additional CPU time is wasted for requesting status information. Section 4.2.2 shows the used interrupt handshake mechanism and explains why the handshake mechanism is of great importance. The asynchronous behavior of an interrupt-service routine (ISR) must be synchronized in order ensure the correct operation of the software. Synchronizing by waiting a xed amount (worst case of the computation time) of time is a bad solution because if the computation is nished earlier time is wasted unused. A mechanism is needed which does not waste CPU time during the wait state and returns immediately after the occurrence of an interrupt. If there occurs no interrupt an error should be reported. For this case it is appropriate to use a semaphore. When a process acquires the semaphore it will be suspended until another process or service routine release it. While the process is suspended it does not waste any CPU time. Microsoft Windows oers the possibility to automatically release a process from the semaphore when a maximum amount of time is over. The following code shows the denition and initialization of a semaphore under Microsoft Windows. HANDLE semaphore; JNIEXPORT jboolean JNICALL Java_rsa_rsa_Interface_startComputation (JNIEnv *, jobject) { // wait one ten milliseconds for interrupt occurrence, // otherwise return false if(WaitForSingleObject(semaphore, 100L) != WAIT_OBJECT_0) return true; else return false; } The mechanism for releasing this semaphore is shown in section 4.1.3. For setting the time-out it is necessary to know the worst case computation time. It can be calculated from number of needed computation cycles for one multiplication. How 45

to calculate the number of computation cycles is shown in section 2.6.2, it is given by: 2 + wordsize(m) +1 (4.1) k The used word size of the prototype was 128 and number of radix bits k was four. This leads to n = 34 computation cycles for one multiplication. Four cycles are needed for loading operands and waiting for the result. Converting the result from carry-save representation to binary representation could take up to wordsize cycles. Therefore 166 computation cycles are need to compute one multiplication and to retrieve the binary representation of the result. If each bit of the exponent is set and the square and multiply algorithm from section 2.5 is used, 128 2 = 256 multiplications are needed to compute the result. Another two mutliplications are needed for the transformation to and from the Montgomery domain. This means one modular exponentiation needs 258 166 = 42828 clock cycles to achieve the correct result at the worst case. With a clock frequency of 40 MHz this would take 1.071 ms. A safety margin was used to dene time-out. If the interrupt does not occurs after 100 ms an error is reported. As shown in section 5.2, the average computation time is much smaller than the worst case computation time. n=

4.1.3

Device Driver

The Spartan II evaluation board from CESYS that was used to implement the prototype already includes a device driver for windows. The driver is mainly written in C. But there exists also an C++ wrapper interface which eases integration. The driver, the documentation, and programming examples can be found in [ces04]. With this device driver it is also easy to implement a ISR. Only the C++ class NotifyEvent needs to be derived and the method NotifyFunc(unsigned long IntContext) must be implemented class ComputationCompleteEvent : public NotifyEvent { public: ComputationCompleteEvent() {} ~ComputationCompleteEvent() {} void NotifyFunc(unsigned long IntContext); }; void ComputationCompleteEvent::NotifyFunc(unsigned long IntContext) { // a interrupt occurs -> releasing the semaphore ReleaseSemaphore(semaphore, 1, NULL); } The method for handling an interrupt is registered by the following code example: 46

ComputationCompleteEvent finishedEvent; ... if (!fpga.StartInterruptCapture(&finishedEvent)) { fpga.Close(); return false; } The device driver has also ability to load a conguration le into the FPGA. This can only be done when the interrupt line of the FPGA board is disabled. The input data and the results are transferred are the PCI bus. The evaluation board uses only eight bit from the 32-bit PCI bus. In practice the achieved transfer rate are between one and two megabytes per second. There are several methods to access the data bus of the FPGA board. The slowest methods are ReadByte and WriteByte, the fastest are ReadBlock and WriteBlock because no additional setup cycles on the PCI bus are needed. When writing data to the RSA chip, it is necessary to convert each input to the used word size, otherwise the state machine of the RSA chip does not know when the input is nished. The conversion is done by zero padding.

4.2

Hardware

The following sections show the architecture of the RSA chip and explain the important parts of it. They also give reasoning for important design decisions.

4.2.1

RSA Chip

The implemented architecture of the RSA chip is shown in gure 4.1. The dominationg part is the Montgomery Multiplier. It oers four operating modes, which are set by the state machine. All input and output values of the Montgomery Multiplier are stored in registers to keep the combinational path as short as possible. The operands of the multiplication are stored in the Operand A Reg and Operand B Reg. The Montgomery constant m is stored in the Montgomery Reg. It is only changed during initial ization and stays constant during the operation. The exponent is stored in the Exponent Reg. The intermediate result is stored in carry-save representation in the two registers sum Reg and carry Reg. The ripple-carry adder RCA is needed to convert the least signicant bits of carry-save representation of the result into binary representation. The comparator Comparator is needed to nd out if the carry is already zero, the reason for this is explained in section 2.8.3. The Single Port RAM saves r2 for transforming the input value a to the Montgomery domain, 47

Figure 4.1: Architecture of the RSA chip 1 for transforming the result from the Montgomery domain to the Z domain, the transformed input a, and the intermediate result which is needed by the square and multiply algorithm. The task of the Interface is to synchronize the dierent clock domains, read control commands and data from the PCI bus, and write the result to the PCI bus. The Control Unit reacts on control commands and controls the whole computation. The registers Operand B Reg and Exponent Reg are parallel-loadable shift-registers.

4.2.2

Control Unit

The control unit manages reading and writing data from the interface to the PCI bus and executes the square and multiply algorithm. A state machine could be described over a Mealy or a Moore state machine. It is also possible to combine Mealy and Moore state machines in one state machine. The dierence between Mealy and Moore state machines is the generation of the output function. When using a Mealy state machine the outputs are a function of the current state and the primary inputs. A Moore state machine has outputs that are a function of the 48

current state only, and so includes outputs direct from the state register. If outputs come direct from the state registers only, there is no output logic. There are dierent state encoding formats. The most common are binary, Gray, one-hot, custom representation. Binary State Encoding Each state is simply assigned by incrementing binary numbers. The number of needed registers for this state encoding is log2 of the total number of needed states. A minimum number of registers is needed for this encoding. Gray and Johnson State Encoding Each state in both Gray and Johnson state encoding is assigned by successive binary numbers where only one bit changes from one state to the next. The primary motive for such coding is to reduce the possibility of state transition errors caused by asynchronous inputs changing during ip-op setup times. When using Johnson encoding more ip-ops are needed for state encoding. To avoid a huge next-state logic, an asynchronous reset is preferred, otherwise all 2n binary numbers must be used for encoding. This results in a larger circuit. One-Hot State Encoding In one-hot state encoding, each state is assigned by its own ip-op, so n states require n ip-ops and only one ip-op is in its true state at a time. This increased number of ip-op usually results in a larger circuit, but the combinational logic to calculate the next state is very small and therefore this is a very fast encoding. Custom Representation The above mentioned formats are chosen by the synthesis tool to minimize next state logic or to minimize combinational logic. Most synthesis tools oer a conguration parameter which dene the preferred state encoding. The state encoding formats are compared in table 4.1. 49

No. Binary Gray 0 000 000 1 001 001 2 010 011 3 011 010 4 100 110 5 101 111 6 110 101 7 111 100

Johnson 0000 0001 0011 0111 1111 1110 1100 1000

One-Hot 00000001 00000010 00000100 00001000 00010000 00100000 01000000 10000000

Table 4.1: State encoding formats Describing State Machines in VHDL To use the synthesis tool for optimizing the state machine its behavior must be described in a particular way. In VHDL state machines consist of combinational next state logic, a sequential state switcher, and a user dened type which includes all possible states. library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity statemachine is port ( clk, reset, rd, wr : in std_ulogic; mode : out std_ulogic_vector(1 downto 0)); end statemachine; ------------------------------------------------------------architecture behavior of statemachine is type STATE_TYPE is (IDLE, READ, WRITE); signal current_state, next_state : STATE_TYPE; begin -- behavior comb : process(current_state, rd, wr) begin mode <= "00"; case (current_state) is when IDLE => if rd = 1 then next_state <= READ; elsif wr = 1 then next_state <= WRITE; else 50

next_state end if; when READ mode next_state when WRITE mode next_state when others next_state end case; end process comb;

<= IDLE; => <= <= => <= <= => <=

"01"; IDLE; "10"; IDLE; IDLE;

sync : process(clk, reset) begin if reset = 1 then current_state <= IDLE; elsif clkevent and clk = 1 then current_state <= next_state; end if; end process; end behavior; When synthesizing this code following result will generate: ================================================================= * HDL Synthesis * ================================================================= Synthesizing Unit <statemachine>. Related source file is C:/Xilinx/designs/statemachine.vhdl. Found finite state machine <FSM_0> for signal <current_state>. ------------------------------------------------------| States | 3 | | Transitions | 5 | | Inputs | 2 | | Outputs | 2 | | Clock | clk (rising_edge) | | Reset | reset (positive) | | Reset type | asynchronous | | Reset State | idle | | Power Up State | idle | | Encoding | automatic | | Implementation | LUT | 51

------------------------------------------------------Summary: inferred 1 Finite State Machine(s). Unit <statemachine> synthesized. ================================================================= * Advanced HDL Synthesis * ================================================================= Advanced RAM inference ... Advanced multiplier inference ... Advanced Registered AddSub inference ... Selecting encoding for FSM_0 ... Optimizing FSM <FSM_0> on signal <current_state> with one-hot encoding. Dynamic shift register inference ...

In this case a state machine with three states (IDLE, READ, WRITE), ve transitions and two outputs is generated. The used encoding format is one-hot encoding.

Architecture of the Control Unit The control unit handles the data exchange through the interface and runs the square and multiply exponentiation algorithm. A state machine with additional register to buer output signals is used as control unit. This minimize the side aects on the timing of the data path as explained in section 3.3.6. Figure 4.2 shows the architecture of the control unit. The control unit uses three counters for the main tasks. The rst counter is needed to know when a read or write transaction is nished. The second counter indicates a valid result after multiplication. The third counter is used to nd out when exponentiation has nished. The main state diagram of the control unit is shown gure 4.3. The RSA chip is waiting for control commands only in the IDLE state. Therefore, running operations could not be interrupted. If the rd signal of the control unit is high and received data from the PCI bus was a control command the next state depends on the value of the control command. When the next state is a READ or WRITE state, the state machine waits until the whole word is processed and returns to the IDLE state. Every time when a byte is read from the interface the word counter is incremented. If the computation should be started the state machine moves from the IDLE state to the COMPUTE state. 52

Figure 4.2: Architecture of the control unit The Computation State This is the most complex part of the control unit. The main parts of the square and multiply algorithm are processed here. In the rst part the input value a is transformed into a in the Montgomery domain and is written back into the single-port RAM. As explained in section 2.6.1 the transformation is done by a multiplication by r2 . The multiplication is done when signal over f low multiply counter is true. The result of each multiplication is in carry-save representation, therefore it must be transformed to the equivalent binary representation (see section 2.8.3). Before starting the square and multiply procedure the MSB of the exponent must be detected. Afterwards in each round the intermediate result is squared. A multiplication by a is only done when the according exponent bit is set (see section 2.5). Again, after each multiplication the conversion of the carry-save representation to the binary representation must be applied. If the whole exponent is processed the irq signal is set to 1. The next section describes the used irq handshake. IRQ Handshake To ensure that the host CPU receives the interrupt request a handshake mechanism is used. The irq signal is high until the cs1 signal goes to low. The cs1 signal is triggered by the CPU. If the cs1 goes to low the irq signal must be set to zero immediately, otherwise the interrupt-service routine could be activated a second time. Figure 4.4 shows the irq handshake circuit. 53

Figure 4.3: State diagram of the control unit

Figure 4.4: IRQ handshake circuit The signal cs1 n is buered to reduce probability of meta-stable states. Only 54

one ip-op is used because otherwise it would take too long to reset the irq signal. The send irq signal is set by the control unit.

4.2.3

Montgomery Multiplier

The Montgomery multiplier is the biggest part of the architecture. It needs about 80% of the available FPGA cells. Depending on the mode signal four dierent operations are performed. Table 4.2 shows all available modes of operation. mode 00 01 10 11 Description Pass through of sum and carry Right shift of sum and carry Shift in of interface data Shift out data to the interface

Table 4.2: Operation mode of the Montgomery multiplier The pass through mode is used for converting the carry-save representation into the binary representation. The control unit has to ensure that all other inputs except the sum and carry inputs are set to zero. The right shifting of the sum and carry inputs is used during the computation because in algorithm 7 the term p is needed. The third mode is used to load input data. The FPGA evaluation 2k board only provides an 8-bit wide PCI interface. Therefore all input bytes must be concatenated by shifting. The last mode is used to transfer the lowest eight bits to the interface and shift them right afterwards. The architecture of the Montgomery multiplier is shown in Figure 4.5. The Montgomery multiplier consist of two high-radix multipliers which compute the products q m and bi a. The Select & Shift unit is used to calculate the term p . The results of the high-radix multipliers are already in carry-save representation. 2k The input and output of the Select & Shift unit is also in carry-save representation. So a six operand Wallace tree is used to from the result p = 2pk + q m + bi a. This is the result of one iteration of algorithm 7. The control unit has to ensure that the right inputs were applied at the right time.

4.2.4

XILINX Single-Port RAM

The XILINX distributed SelectRAM+ allows to congure a 16x1 single-port RAM cell within one slice, this means that the needed space is the same as implementing a 1-bit register. So saving 16 dierent 1-bit values in a SelectRAM+ instance requires the same number of slices as saving one bit in a register. The disadvantage is that the access to a particular value is about ten times slower than the access to a register. So it should only be used for saving constants and intermediate results. Reading a value from SPRAM during a write operation may cause undened output values. The control unit has to prevent reading and writing at the same time. The following 55

Figure 4.5: Architecture of the Montgomery multiplier code shows how to describe a single-port RAM in VHDL. The behavioral description style assures platform independence. Nevertheless, XILINX synthisizers recognize it as SelectRAM+. library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity xilinx_single_port_ram is generic ( data_bits : integer := 128; addr_bits : integer := 4); port ( clk, load : in std_ulogic; addr : in std_ulogic_vector((addr_bits-1) downto 0); data_in : in std_ulogic_vector((data_bits-1) downto 0); data_out : out std_ulogic_vector((data_bits-1) downto 0)); end xilinx_single_port_ram; ---------------------------------------------------------------architecture behavior of xilinx_single_port_ram is type ram_type is array((2 ** addr_bits)-1 downto 0) of std_ulogic_vector(data_bits-1 downto 0); 56

signal ram : ram_type; begin process(clk, load, addr) begin if clkevent and clk = 1 then if load = 1 then ram(conv_integer(unsigned(addr))) <= data_in; end if; end if; end process; data_out <= ram(conv_integer(unsigned(addr))); end behavior; The RSA implementation uses a address width of two bits, so four dierent values can be stored in the single-port RAM. The following values are stored in the single-port RAM (SPRAM): At address 00 r2 is stored which is needed for the transformation of integers from the Z domain to the Montgomery domain. At address 01 the number one is stored which is needed for the transformation from the Montgomery domain to the Z domain. At address 10 a is stored and it is replaced with a after the transformation to the Montgomery domain. At address 11 the result of one Montgomery multiplication is stored. It will be overwritten after the next multiplication is nished. The value is needed during the square operation of algorithm 7.

4.2.5

Parallel-Loadable Shift-Register

It is also possible to describe parallel-loadable shift-register in such a manner to optimize the number of used slices and the achievable speed. The following VHDL code describes a shift register with congurable word size and shift size. For the RSA implementation also a synchronous reset is needed. library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity xilinx_shift_reg is generic ( width : integer := 128; 57

shift_width : integer := 8); port ( load, shift : in std_ulogic; clk, reset : in std_ulogic; sync_reset : in std_ulogic; data_in : in std_ulogic_vector(width-1 downto 0); data_out : out std_ulogic_vector(shift_width-1 downto 0)); end xilinx_shift_reg; -----------------------------------------------------------------architecture behavior of xilinx_shift_reg is signal regs : std_ulogic_vector(width-1 downto 0); signal zeros : std_ulogic_vector(shift_width-1 downto 0); begin zeros <= (others => 0); process (clk, reset) begin -- process shift if reset = 1 then regs <= (others => 0); elsif clkevent and clk = 1 then -- rising clock edge if sync_reset = 1 then regs <= (others => 0); elsif load = 1 then if shift = 1 then regs <= zeros & regs(width-1 downto shift_width); else regs <= data_in; end if; end if; end if; end process; data_out <= regs(shift_width-1 downto 0); end behavior;

4.2.6

PCI Interface

The main task of the PCI interface is to synchronize the dierent clock domains between the PCI bus and the FPGA clock. The read and the write signals need to be buered by ip-ops to reduce the probability of meta stable states. Figure 4.6 shows the timing diagram of a write transaction and gure 4.7 shows the timing diagram of a read transaction. The diagrams are take form the hardware documentation of the CESYS evaluation board [ces04]. 58

PCI Write Transaction

Figure 4.6: Timing diagram of the PCI write transaction When WR is high and CS2 is low the data can be read safely from the PCI bus because the data is stable two clock cycles before. The clock frequency of the FPGA must be higher than the clock frequency of the PCI bus otherwise it is possible to miss the stable period. PCI Reade Transaction When the RD is low and CS2 is low data can be written to the bus. The data must stay constant until the RD signal gets high. Since tri-state drivers are used to put the data on to the PCI bus, the interface must ensure that no data will be driven on to the PCI bus when the PCI bus itself tries to write data to the device because this could destroy the PCI bus drivers. Figure 4.8 shows the circuit of the interface of the RSA chip. 59

Figure 4.7: Timing diagram of the PCI read transaction

4.2.7

High-Radix Multiplier

The high-radix multiplier operates at full word size. A Wallace tree is used to perform the multi-operand addition. The result of the multiplication is represented in carry-save form. Figure 4.9 shows the architecture of the high-radix multiplier. The input of the multiplier is represented in binary form. It is easy to build a high-radix multiplier where the input are represented in carry-save form too. This is done by simply doubling the circuit but this means also that the needed area is doubled too. The advantage would be that there is no need for converting intermediate results into binary representation, which would save clock cycles. With a high probability this would save only four clock cycles per multiplication but the hardware is nearly doubled because of the bigger multipliers and the additionally needed registers. So the achieved word size would be reduced to the half when the circuit occupies the same area. 60

Figure 4.8: Schematic of the interface

4.2.8

Wallace Tree

The Wallace tree is one of the most often used parts of the whole RSA architecture. The VHDL code is very hard to read, because of the complex behavior of a parameterizeable implementation. The following VHDL code shows a parameterizeable version of the Wallace tree. library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_arith.all; entity wallace_tree is generic ( operands : integer := 12; width : integer := 1024); 61

Figure 4.9: Architecture of the high-radix multiplier port ( input : in std_ulogic_vector(operands*width-1 downto 0); carry_in : in std_ulogic_vector(operands-3 downto 0); sum, carry : out std_ulogic_vector(width-1 downto 0)); end wallace_tree; ---------------------------------------------------------------------architecture structure of wallace_tree is component carry_save_full_adder generic ( width : integer); port ( a, b, c : in std_ulogic_vector(width-1 downto 0); carry_in : in std_ulogic; sum, carry : out std_ulogic_vector(width-1 downto 0)); end component; signal data : std_ulogic_vector(((2+3*(operands-2))*width)-1 downto 0); begin wallace : for count in 0 to (operands-2)-1 generate csa_i : carry_save_full_adder generic map ( width => width) port map ( 62

a b c carry_in sum carry end generate

=> data((2+count*3+3)*width-1 downto (2+count*3+2)*width), => data((2+count*3+2)*width-1 downto (2+count*3+1)*width), => data((2+count*3+1)*width-1 downto (2+count*3)*width), => carry_in(count), => data((count*2+2)*width-1 downto (count*2+1)*width), => data((count*2+1)*width-1 downto count*2*width)); wallace;

inputs : for i in 0 to operands-1 generate data((2*(operands-2)+i+1)*width-1 downto (2*(operands-2)+i)*width) <= input((i+1)*width-1 downto i*width); end generate inputs; output : sum <= data((2*width-1) downto width); carry <= data((width-1) downto 0); end structure; The Wallace tree is generated with CSA cells. The only disadvantage of a CSA cell is the mapping to the four input LUT, because one CSA cell has three inputs and two outputs. On XILINX FPGA very fast ripple-carry adders can be implemented as shown in section 4.2.10. This is leading to a modied version of the Wallace tree.

4.2.9

Modied Wallace Tree

The idea behind the modied Wallace tree is to exchange the CSA cell with a ripplecarry adder. As shown in the next section this makes only sense with adder widths equal or greater than four bit. The advantage is that the needed area is about 60% of the Wallace tree, but the achieved clock frequency is lower because of the longer combinational paths. Another problem is that sometimes a carry decompression circuit is needed, because the carry output cannot be used with an adder input due to the dierent word sizes. However, the modication was implemented successfully. So higher radices were achieved, but the achieved clock frequency was less than the half. Therefore, the total throughput was less.

4.2.10

Ripple-Carry Adder

Implementing fast adders on FPGA platform diers from implementing fast adders on CMOS. Especially if the needed word size is lower than 128 bit. The advantage of using a ripple-carry adder on an FPGA is that no routing delay occurs within the adder. So a 128-bit ripple-carry is faster than a 128-bit carry-look-ahead adder on FPGA platforms. The following VHDL code describes how to implement a parameterizeable ripple-carry adder without internal routing delay. 63

library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity xilinx_ripple_carry_adder is generic ( width : integer := 24); port ( a, b : in std_ulogic_vector(width-1 downto 0); c_in : in std_ulogic; sum : out std_ulogic_vector(width-1 downto 0); c_out : out std_ulogic); end xilinx_ripple_carry_adder; ---------------------------------------------------------------architecture behavior of xilinx_ripple_carry_adder is signal tmp_sum : std_logic_vector(width downto 0); signal zeros : std_logic_vector(width-1 downto 0); begin zeros <= (others => 0); tmp_sum <= (0 & std_logic_vector(a)) + (0 & std_logic_vector(b)) + (zeros & std_logic(c_in)); sum <= std_ulogic_vector(tmp_sum(width-1 downto 0)); c_out <= std_ulogic(tmp_sum(width)); end behavior; The synthesis tool from XILINX will produce the following timing report Timing constraint: Default path analysis Delay: 11.418ns (Levels of Logic = 19) Source: a<0> (PAD) Destination: sum<15> (PAD) Data Path: a<0> to sum<15> Gate Net Cell:in->out fanout Delay Delay ---------------------------------IBUF:I->O 1 0.924 1.150 LUT2_L:I0->LO 1 0.653 0.000 MUXCY:S->O 1 0.784 0.000 MUXCY:CI->O 1 0.050 0.000 MUXCY:CI->O 1 0.050 0.000 MUXCY:CI->O 1 0.050 0.000 64

Logical Name -----------a_0_IBUF sum<0>lut sum<0>cy sum<1>cy sum<2>cy sum<3>cy

MUXCY:CI->O 1 0.050 0.000 sum<4>cy MUXCY:CI->O 1 0.050 0.000 sum<5>cy MUXCY:CI->O 1 0.050 0.000 sum<6>cy MUXCY:CI->O 1 0.050 0.000 sum<7>cy MUXCY:CI->O 1 0.050 0.000 sum<8>cy MUXCY:CI->O 1 0.050 0.000 sum<9>cy MUXCY:CI->O 1 0.050 0.000 sum<10>cy MUXCY:CI->O 1 0.050 0.000 sum<11>cy MUXCY:CI->O 1 0.050 0.000 sum<12>cy MUXCY:CI->O 1 0.050 0.000 sum<13>cy MUXCY:CI->O 1 0.050 0.000 sum<14>cy XORCY:CI->O 1 0.500 1.150 sum<15>_xor OBUF:I->O 5.557 sum_15_OBUF ---------------------------------------Total 11.418ns (9.118ns logic, 2.300ns route) (79.9% logic, 20.1% route) Without the input and output buers the delay for a 16-bit ripple carry adder is only 3.787 ns, which means that the circuit could operate at a clock frequency of 264 MHz.

65

Chapter 5 Results
This chapter discusses the achieved results. It includes the results measured by the Java application, as well as the throughput results using the XILINX Virtex II line. At the end of this chapter result are compared with related work.

5.1

Word Size and Radix

Table 5.1 shows the achieved word size with the corresponding radix on a XILINX Spartan II FPGA. The device utilization was 99.9%, which means after the routing only two slices were left. The bigger the implemented circuit gets, the higher performance gets loss due the routing. For example, when using a word size of 128 and a radix of 16 the synthesis on a Xc2s200 (XILINX Spartan II) device indicates a clock frequency of 58.73 MHz. This value could be imporved during the routing to 70.572 MHz. If a word size of 1024 and a radix of 12 is used, the synthesis result on a Xc2v3000 (XILINX Virtex II) device promise a clock frequency of 148.35 MHz. The maximum achievable clock frequency after the routing is only 109.48 MHz. This means a performance loss of 25%. The results of the XILINX Virtex II line are shown in table 5.2, 5.3, 5.4, and 5.5. Word size ld(Radix) Clock frequency [bit] [bit] [MHz] 128 4 70.572 136 4 70.156 160 3 75.431 208 2 80.232 288 1 85.229 Table 5.1: XILINX Xc2s200 synthesis results Comparing the results of the XILINX Virtex2 line shows that the achieved clock frequency is mostly dependent on the used radix. The inuence of the word size 66

Word size ld(Radix) Clock frequency [bit] [bit] [MHz] 512 8 74.694 1024 3 109.649 2048 1 132.434 Table 5.2: XILINX Xc2v3000 synthesis results Word size ld(Radix) Clock frequency [bit] [bit] [MHz] 512 14 59.494 1024 6 82.168 2048 2 118.702 Table 5.3: XILINX Xc2v4000 synthesis results Word size ld(Radix) Clock frequency [bit] [bit] [MHz] 512 22 53.414 1024 9 65.854 2048 4 94.6862 Table 5.4: XILINX Xc2v6000 synthesis results Word size ld(Radix) Clock frequency [bit] [bit] [MHz] 512 32 50.374 1024 13 60.143 2048 6 81.833 Table 5.5: XILINX Xc2v8000 synthesis results is minimal. The routing delays will increase slightly. Comparing the throughput, the radix, and the achieved clock frequency of the XILINX Virtex2 line shows that a higher radix improves throughput although the clock frequency is falling. This is an important point for a scalable design. The reason of the scalability is mainly founded by the usage of Wallace trees and the use of carry-save adders. The critical path of Wallace trees does not grow linearly with the number of inputs.

5.2

Performance and Throughput

With the values from table 5.1 the maximum throughput can be calculated. The number of cycles for one multiplication is given by 67

n=

2 + wordsize(m) + 1. k

(5.1)

Additionally log2 m cycles are needed for converting the result from carry-save representation into binary representation. Loading the operands also needs two additional clock cycles, and after the multiplication also two clock cycles are needed because of the buered control signals. The average number of needed multiplications is given by c = wordsize(e) 1.5 + 2. (5.2)

The multiplication with 1.5 assumes that the binary representation of the exponent contains as many 0s as 1s. The addition of two represents the transformation to and from the Montgomery domain. If the F4 exponent (216 + 1) is used this will be reduced to c = 17 + 2 which is independent from the used word size. The throughput for average exponents can be calculated by t= wordsize(m) f . c (n + 4 + log2 (wordsize(m))) (5.3)

Where f is the used clock frequency of the circuit. The throughput for the F4 exponent can be calculated by t= wordsize(m) f . c (n + 4 + log2 (wordsize(m))) + wordsize(m) 16 (5.4)

The term wordsize(m) 16 is equal to the number of cycles which were needed to nd the MSB in the exponent. Table 5.6 shows the achieved throughputs on the XILINX Spartan II FPGA. Word size ld(Radix) Clock freq. [bit] [bit] [MHz] 128 4 70.572 136 4 70.156 160 3 75.431 208 2 80.232 288 1 85.229 Cycles per Throughput Throughput F4 multiplication [Mb/s] [Mb/s] 45 1.034 9.336 48 0.965 9.245 67 0.744 8.517 118 0.450 6.856 304 0.186 4.059

Table 5.6: XILINX Xc2s200 throughput results The PCI interface of the CESYS evaluation board has only a throughput 1-2 MB/s, which is in the range of circuits throughput when using the F4 exponent. This means an over all performance loss when accessing the RSA chip. When using a standard PCI interface this should be no problem because the PCI bus has a maximum throughput of 133 MB/s. 68

To track down the bottlenecks of the whole application, the computation time was measured in the implementation of the JNI interface and the overall computation time was measured in the Java application. To eliminate statistical outliers, 100,000 operations were performed. The device under test uses a word size of 128, a radix of 4, and clock frequency of 40 MHz. The calculated average computation time is + 1) + 4 + log2 128 8730 = = 218.25s. (5.5) 6 40 10 40 106 The mean value measured in the JNI implementation is 300s. This means a performance loss of 28% which is mainly caused by the slow interface of the CESYS evaluation board but it is also possible that the device driver causes performance loss because the driver does not emplosy latest technology. When the RSA acceleration card was used the Java application needs 35, 126ms for 100,000 operations. Thus, the average computation time is 351.26s. The over all performance loss is 37.83%. The tables 5.7, 5.8, 5.9, and 5.10 show the throughput results for the XILINX Virtex2 line. t= Word size ld(Radix) Clock freq. [bit] [bit] [MHz] 512 8 74.694 1024 3 109.649 2048 1 132.434 Cycles per Thr. multiplication [Mb/s] 79 0.629 357 0.204 2066 0.043 Thr. F4 [Mb/s] 19.150 14.412 6.569 (
2+128 4

Table 5.7: XILINX Xc2v3000 throughput results

Word size ld(Radix) Clock freq. [bit] [bit] [MHz] 512 14 59.494 1024 6 82.168 2048 2 118.702

Cycles per Thr. multiplication [Mb/s] 51 0.776 186 0.294 1041 0.076

Thr. F4 [Mb/s] 20.792 18.525 11.146

Table 5.8: XILINX Xc2v4000 throughput results

Word size ld(Radix) Clock freq. [bit] [bit] [MHz] 512 22 53.414 1024 9 65.854 2048 4 94.686

Cycles per Thr. multiplication [Mb/s] 38 0.935 129 0.339 529 0.119

Thr. F4 [Mb/s] 22.453 19.495 16.049

Table 5.9: XILINX Xc2v6000 throughput results

69

Word size ld(Radix) Clock freq. [bit] [bit] [MHz] 512 32 50.374 1024 13 60.143 2048 6 81.833

Cycles per Thr. multiplication [Mb/s] 31 1.088 94 0.426 358 0.152

Thr. F4 [Mb/s] 23.941 22.042 18.971

Table 5.10: XILINX Xc2v8000 throughput results This means on average up to 416 1024-bit signatures can be processed per second. When using the F4 exponent 21,525 1024-bit signatures can be processed per second. This example shows the extreme dependency between the used exponent and the computation time.

5.3

Comparison with other Implementations

In this section the achieved results are compared with other implementations. The compared FPGA implementation of Blum and Paar is described at [BP99]. Orups implementation [Oru95] was only a conceptual design. To have some sources to compare the results, also the JCE implementation of the IAIK was used [iai05]. The implementations are compared using the throughput rates. Both CPUs, AMD Opteron and Intel Pentium IV, are running on 2 GHz system clock. The comparison is shown in table 5.11 and 5.12. W. size Xc2v3k [bit] [Mb/s] 512 0.629 1024 0.204 2048 0.043 Xc2v4k Xc2v6k Xc2v8k [Mb/s] [Mb/s] [Mb/s] 0.776 0.935 1.088 0.294 0.339 0.426 0.076 0.119 0.152 Blum Orup [Mb/s] [Mb/s] 0.216 1.6 0.100 AMD64 P4 [Mb/s] [Mb/s] 0.441 0.129 0.214 0.046 0.069 0.014

Table 5.11: Throughput results with average exponent

Word size Xc2v3k [bit] [Mb/s] 512 19.150 1024 14.412 2048 6.569

Xc2v4k Xc2v6k Xc2v8k [Mb/s] [Mb/s] [Mb/s] 20.792 22.453 23.941 18.525 19.495 22.042 11.665 16.049 18.971

Blum AMD64 P4 [Mb/s] [Mb/s] [Mb/s] 1.463 4.170 1.580 1.365 3.447 0.944 2.296 0.504

Table 5.12: Throughput results with F4 exponent (216 + 1) With the XILINX Xc2v3000 it is possible achieve the same results as with an AMD Opteron operating in 64 bit mode. Compared to the Intel Pentium IV the throughput of Xc2v3000 is trebled. The same implementation is about two to three 70

times faster than Blums implementation. When using the F4 exponent the implementation on the Xc2v3000 is about ve times faster than the Opteron and about twelve times faster than the Intel Pentium IV.

71

Chapter 6 Conclusion
6.1 Summary

Todays FPGAs are ready for RSA implementations operation on full word size! With Orups optimizations of the Montgomery algorithm it is possible to implement RSA with common word sizes like 1024 bits, 1536 bits, and 2048 bits. Because of the possibility of changing the radix in 2k (with k 1, 2, 3,. . . ) steps the utilization of the used device can be adapted as desired. The results obtained in this work beat a software implementation on the AMD Opteron. The results also shows that a proper interface design and a well implemented device driver are needed to provide the same throughput on application layer as on the hardware layer. Another great aord is to integrate the RSA accelerator card into an existing project without changing the code base. Due to the JNI interface it possible to access the card from Java as a conventional subroutine call.

6.2

Future Work

Booths recoding turned out to be incompatible to Ourps optimizations when using radices higher than two. Therefore, continued resarch might bring up methods to use Booths recoding in conjunction with Orups optimization. It is dicult to say if this would increase the performance because Booths recoding necessitates to handle signed numbers. Orup and Kornerup invented a parallel algorithm for modular exponentiation [OK91]. With this algorithm it is possible to increase the throughput by the factor 1.5. But additional hardware is required to implement the algorithm. To speed up the implementation pipelining could be used at suitable positions of the circuit. Again, it is dicult to predict if this would increase performance. In particular, it would be interesting if pipelining the quotient determination improves performance. 72

Also the development of new FPGA features could speed up the whole implementation. Maybe also full-custom or semi-custom designs are suitable for the proposed full word-size RSA architecture.

73

Bibliography
[Ash02] [BD00] Peter J. Ashenden. The Designers Guide to VHDL. Morgan Kaufmann Publishers, 2nd edition, 2002. D. Boneh and G. Durfee. Cryptoanalysis of rsa with private keys d less than n0.292 . IEEE Transactions on Information Theory, 46(4), 2000.

[BFRV92] Stephen D. Brown, Robert J. Francis, Jonathan Rose, and Zvonko G. Vranesic. Field-Programmable Gate Arrays. Kluwer Academic Publishers, 1992. [BP99] Thomas Blum and Christof Paar. Montgomery modular exponentiation on recongurable hardware. 14th Symposium on Computer Arithmetic, April 1999. Johannes Buchmann. Einfhrung in die Kryptographie. Springer, 3rd u edition, May 2003. Buschmann. Pattern-Oriented Software Architecture, volume 1. Wiley, 1996. Cesys spartan 2 evaluation board documentation. http://www. cesys.com/index.php?language=de&doc=advanced&docparams=XC2S_ EVAL&menuparams=55, 2004. Whiteld Die and Martin E. Hellman. New direction in cryptography. IEEE Transactions on Information Theory, 1976. Stephen E. Eldridge and Colin D. Walter. Hardware implementation of montgomerys modular multiplication algorithm. IEEE Transactions on Computers, 42, July 1993. The iaik java cryptography extension. http://jce.iaik.tugraz.at/ products/01_jce/index.php, 2005. How to implement a provider for the java cryptography extension. http://java.sun.com/j2se/1.4.2/docs/guide/security/jce/ HowToImplAJCEProvider.html, 2004. 74

[Buc03] [Bus96] [ces04]

[DH76] [EW93]

[iai05] [jce04a]

[jce04b]

Java cryptography extension (jce) reference guide. http://java. sun.com/j2se/1.4.2/docs/guide/security/jce/JCERefGuide.html, 2004.

[LFB+ 00] Peter Lipp, Johannes Farmer, Dieter Bratko, Wolfgang Platzer, and Andreas Sterbenz. Sicherheit und Kryptographie in Java: Einfhrung, Anu wendungen und Lsungen. Addision-Weseley, July 2000. o [LV99] [LW99] [LW00] [Mon85] [Mur94] A. K. Lenstra and E. R. Verheul. Selecting cryptographic key sizes. October 1999. Jey-Jong Leu and An-Yeu Wu. A scaleable low-complexity digital-serial vlsi architecture for rsa cryptosystem. SiPS, 1999. Jey-Jong Leu and An-Yeu Wu. Design methodology for booth-encoded montgomery module. ICSAS, 2000. P. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44, April 1985. Raymond Murphy. English Grammar In Use: A reference an practice book for intermediate students. Cambridge University Press, 12th edition, 1994. Holger Orup and Peter Kornerup. A high-radix hardware algorithm for calculation the exponential mE modulo n. 10th IEEE Symposium on Computer Arithmetic, 1991. Holger Orup. Simplifying quotient determination in high-radix modular. 12th Symposium on Computer Arithmetic, 1995. Behrooz Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford University Press, 2000. Ronald L. Rivest, Adi Shamir, and Leonard Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, 21(2):120126, February 1978. Ronald L. Rivest, Adi Shamir, and Leonard Adleman. On digital signatures and public key cryptosystems. Technical report, MIT Laboratory for Computer Science, January 1979. Public-key cryptographic standards. 2004. http://www.rsasecurity.com,

[OK91]

[Oru95] [Par00] [RSA78]

[RSA79]

[rsa04] [Sch96]

Bruce Schneier. Applied Cryptography: Protocols, Algorithms and Source Code in C. John Wiley & Sons, Inc, 2nd edition, 1996. 75

[Ste04] [Sti95] [SV93] [tom04] [Wal64] [Wal91] [xil04] [xil05] [Yal00] [YCC98]

Beth Stearns. Java native interface. http://java.sun.com/docs/ books/tutorial/native1.1/index.html, 2004. Douglas R. Stinson. Cryptography: Theory and Practice. CRC Press LLC, 1995. M. Shand and J. Vuillemin. Fast implementations of rsa cryptography. Proc 11th Symp. on Computer Arithmetic, 1993. Apache jakarta tomcat. http://jakarta.apache.org/tomcat, 2004. C. S. Wallace. A suggestion for a fast multiplier. IEEE Transactions on Electronic Computers, 1964. Colin D. Walter. Fast modular multiplication using 2-power radix. Internation Journal of Computer Mathematics, 39, July 1991. Xilinx support. http://www.xilinx.com, 2004. Xilinx spartan-ii 2.5v fpga complete data sheet. http://direct.xilinx. com/bvdocs/publications/ds001.pdf, 2005. Sudhakar Yalamanchili. Vhdl: From simulation to synthesis. http: //users.ece.gatech.edu/~sudha/book/synthesis, 2000. C. C. Yang, T. S. Chang, and C. W. Chen. A new rsa cryptosystem hardware design based on montgomerys algorithm. IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal Processing, 45(7), July 1998.

76

You might also like