Professional Documents
Culture Documents
fangyl@dns.ime.tsinghua.edu.cn
Abstract
In this paper, we propose an efficient hardware-oriented on modular multiplication algorithm based Montgomerys algorithm. We employ the high-radix technique and modify the original Montgomerys algorithm to reduce hardware complexity and improve processing speed. A RSA cryptosystem hardware design based on this proposed algorithm is presented. The design has been implemented to a single-chip 512-bit RSA processor with CSMC (Central Semiconductor Manufacture Corporation) 0.6um CMOS standard cell library. The processor contains about 96k gates and delivers a baud rate of 113kbits/sec with 40MHz clock in the worst case.
Introduction
As the telecommunication network has grown explosively and the internet has become increasingly popular, their various applications cover almost every aspect of human life, including some very important fields like person identification and commerce. So the network security becomes a more and more serious issue. The fundamental security requirements include confidentiality, authentication, data integrity and nonrepudiation. One efficient solution to the network security issue is the public key cryptographyCl1. Among the various public key cryptography algorithms, the RSA cryptosystem[2] is one of the most efficient, versatile and widely used public key cryptosystems today. To encrypt and decrypt, the input text is first encoded to a numeric format and divided into blocks of suitable size. for The blocks are then processed a s ~ = M E ( ~ o d N ) encryption and M = CD(modN)for decryption. Where hf, C are the plaintext and ciphertext blocks, respectively. N, E, and D are the cryptosystemparameters.
The modular exponentiation is the main computation of the RSA cryptosystem. The modular exponentiation can be reduced to modular multiplication (m(mod N ) ) . Among the many algorithms to perform the modular multiplication, the Montgomerys algorithm[3] is of low complexity and high efficiency, which make it most popular in RSA cryptosystem realizations. The paper is organized as follows. In the 2nd section, we propose a hardware-oriented modular multiplication and exponentiation algorithms based on high-radix Montgomerys algorithm. And the 3rd section describes the hardware design implementation of the RSA cryptosystem using the proposed algorithm. Finally we conclude the paper in the last section.
return S ;
0-7803-6677-8/01/$10.000200 1 IEEE.
348
Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.
Here A, B, N and S are all w -word integers and T is a (w+ 1) -word integer. sl , s2 represent the sum words and cl, c2 represent the carry words. The algorithm in which the Montgomery Product is used to compute the modular exponentiation is given below. Let M and C be the plaintext and ciphertext, respectively. c = M ~ ( ~ ~ N ) N is the modulus and E is the where exponent. M, C, N and E are all U -bit integers. R has the same definition in Algorithm 1. Usually we let
F!=2.
Montgomerys
Algorithm 3. ModExp(M, E, N) { C=l; M=MontPro(M,RZmod N ,N); for i=O to U-1 do { if ( e j = 1) C=MontPro(M, C, N); M=MontPro(M, M,N); return C ,
rn = (so + t o ) . nb (mod r ) :
(c2,s2)=so+rn.n,+to: for j = 1 to w-1 do
I
Hardware Implementation of RSA Cryptosystem
System Architecture
Fig 1 shows the architecture of our RSA cryptosystem design based on Algorithm 2 and Algorithm 3. The RSA cryptosystem works in two modes: programming mode and processing mode. In the programming mode @rogram=l), the cryptosystem loads in RSA operation parameters such as Nand E from ports din and e-in. In the processing mode (program=O), the cryptosystem first loads in the message data block to be processed while new-msg=l, then do RSA encryptioddecryption
(cl,sl) = tj,
t j = s1;
+ u j . bi + cl ;
(c2,s2)=sj+ m . n j +c2:
SI-, = s2 ;
r, = c l ;
s ,
= c2 ;
I
(c2,s2) = so + t ;
cl=l:
349
Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.
on the data block while new-msg=O. When the RSA operation is completed, port state outputs a positive impulse indicating that the result data can be read from port dout.
The RSA cryptosystem is partitioned into four parts: one register file, one modular exponentiation controller, one Montgomery controller and two identical Montgomery datapath. Observing that in Algorithm 3 the two Montgomery modular multiplications in the iteration are identical and simultaneous, we use only one Montgomery controller to control two identical Montgomery datapath, which reduces the hardware complexity.
clock
contains arithmetic units and registers for the Montgomery modular multiplication, including two multiply-add units for arithmetic operations, some multiplexers to choose the proper operands and some loop-shift registers to store the variables such as S and T in Algorithm 2. Each multiply-add unit contains a k x k multiplier and a k + k + 2k adder, where k is the word length. The results of the multiplier and the adder are registered to make the multiply-addition operations pipelined.
1 b
Shift Registers
Prcduct Register
1 1
-1
Module Design
Register Rekister
Montgomery
Fig 3. Multiply-add Unit (3)-merY Controller. The Montgomery controller controls the Montgomery Product computation process. We use two state machines. One is negative edge triggered, switching the states and generating control codes for sequential logic and the other is positive edge triggered, generating control codes for combinational logic.
The RSA cryptosystem contains four modules: register file, Montgomery datapath, Montgomery controller and modular exponentiationcontroller. ( 1 ) W s t e r Fik. The register file contains registers for operation parameters, constants and variables such as N, M , C, E and ( R2 mod N ) in Algorithm 3. Observing that the involved operands are dealt word by word in Algorithm 2, we store the operands in loop-shift registers, which loop-shift the operands one word at the end of each corresponding iteration. The shifting feature makes the controller simpler and the looping feature guarantees that the data will not be lost. The registers also have external-inputs to load in initial data. Figure 2 shows the typical architecture of the register.
. . ( 4 ) M o d u l a r o n contral h . The modular exponentiation controller controls the shifting of the exponent. It receives the ending signal from the Montgomery controller and shifts the exponent one bit. It also counts the shifted bits and gives out an ending signal for the whole encryptioddecryption prooess when it finishes with the last bit of the exponent.
Pegormance and Features
To measure the performance of our design, we implemented it to a 512-bit RSA processor. In the implementation, we let the word length be 32 bits. Larger word length reduces the number of clock cycles needed for computation but on the other hand the operation of longer words has greater time delay that limits the clock frequency. Empirically we take 32 bits as the word length to get the trade-off.
shift enable
clock
We use the fast carry look-ahead model to implement the 32+32+64 adder and the Booth-encoded Wallace-tree model to implement the 32x32 multiplier. When mapped
350
Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.
to CSMC 0.6um CMOS standard cell library, the adder shows a 21.33ns critical path delay and the multiplier shows a 21.73ns critical path delay. Even if we take 15 percent of the delay as the design margin, the max delay is about 25ns, so the RSA processor can operate under a 40MHz clock. According to Algorithm 2 and Algorithm 3, the RSA encryptioddecryption needs (wz 5w + 5) clock + cycles, where U is the data length and w is the number of words. If k is the word length, there holds U = k .w. For our 512-bit RSA processor, u=512, k 3 2 and -16, so it takes about 0.18M clock cycles to complete one RSA encryptioddecryption. Table 1 lists the main features of our RSA processor and some other recently presented RSA realizations. With comparable hardware complexity, our design greatly reduces the number of clock cycles by taking advantage of the high-radix technique, so it can operate at a baud rate of 113kbitdsec even under a relatively low clock frequency of 40MHz.
is also scalable for different numbers of bits in RSA cryptosystems. These features make our design a good candidate for the RSA cryptosystem hardware implementation.
References
[ l ] W. Diffie and M. Hellman, New Directions in Cryptography, IEEE Transactions on Information Theory, vol. IT-22, pp. 644-654, November 1976. [2] R. Rivest, A. Shamir and L. Adleman, A Method for Obtaining Digital Signatures and Public-Key Cryptosystems, Communications of the ACM, vol. 21, pp. 120-126, February 1978. [3] P. L. Montgomery, Modular Multiplication without Trial Division, Math. Computation, vol. 44, pp. 519521, April 1985. [4] H. Orup, A 100Kbitsls single chip modular exponentiation processor, in HOT Chips VI, Symp. Rec., pp. 53-59, 1994. [5] S. Ishii, K. Ohyama, and K. Yamanaka, A single1 chip RSA processor implemented in 0 . 5 ~ rule gate array, in Proc. 7 Annu. IEEE Int. ASIC Conf. Exhibit, pp. 433-436, 1994. [6] P. S. Chen, S. A. Hwang, and C. W. Wu, A systolic RSA public key cryptosystem, in Proc. IEEE International Symposium on Circuits and Systems, vol. 4, pp. 408-411, 1996. [7] Ching-Chao Yang, Tian-Sheuan Chang, and CheinWei Jen, A New RSA Cryptosystem Hardware Design Based on Montgomerys Algorithm, IEEE Transactions on Circuits and Systems-11: Analog and Digital Signal Processing. Vol. 45, No. 7, pp. 908-913, July 1998.
Conclusions
In this paper, we propose a hardware-oriented RSA encryptioddecryption algorithm and its VLSI architecture based on the high-radix Montgomerys algorithm. Using the CSMC 0.6um CMOS standard cell library, we implemented our design to a 32-bit-radixed 512-bit RSA processor. The processor contains about 96k gates and it takes about 0.18M clock cycles to complete a 512-bit RSA encryptioddecryption, delivering a baud rate of 113kbitdsec at a clock frequency of 40MHz in the worst case. It has relatively low hardware complexity and high processing speed. It
35 1
Authorized licensed use limited to: Padre Conceicao College of Engineering. Downloaded on July 24,2010 at 05:59:51 UTC from IEEE Xplore. Restrictions apply.