You are on page 1of 7

GRD Journals | Global Research and Development Journal for Engineering | International Conference on Innovations in Engineering and Technology

(ICIET) - 2016 | July 2016

e-ISSN: 2455-5703

Pipelined VLSI Architecture for RSA Based on


Montgomery Modular Multiplication
1Vinodhini.N 2Suganya.C
1,2

Research Scholar
Department of Electronics and Communication Engineering
1,2
Dr.Mahalingam College of Engineering and Technology, Pollachi-642002 INDIA
1,2

Abstract
Modular multiplication forms a key operation in many public key cryptosystems. Montgomery Multiplication is one of the wellknown algorithms to carry out the modular multiplication more quickly. Carry Save Adders are employed to avoid carry
propagation at each addition operation. To reduce the extra clock cycles, Configurable carry save adder either with one full-adder
or two half-adders can be employed. In addition to that, a mechanism used to skip the unnecessary carry-save addition operations
in the one-level CCSA while maintaining the short critical path delay had been developed. In the proposed architecture, maximum
worst case delay is analyzed to enhance the throughput. In the path, additional buffers are introduced so that the clock is
synchronized to reduce the worst case delay. As a result, pipelining concept is introduced which increases the speed and achieves
a high throughput. The pipelined architecture is applied in RSA public key algorithm to increase the throughput of RSA
cryptosystem.
Keyword- Carry save addition, Montgomery modular multiplier, Pipelining, RSA
__________________________________________________________________________________________________

I. INTRODUCTION
The increase in data communication and internet services like electronic commerce, the security occupies an important role over
the inter-network. Public key cryptosystems by Rivest,R.L., et al provides data security to such networks. In these cryptosystems,
modular multiplication (MM) plays an important role in arithmetic functions. To enhance security, MM with large integers is
preferred. Montgomery multiplication proposed by Montgomery.P.L.is one of the fast algorithms to carry out the MM more
quickly. This algorithm determines the quotient only depending on the least significant digit of operands and replaces the
complicated division with a series of shifting modular additions. Montgomery MM is given by=A*B*R-1(Mod N) where, N is the
k-bit modulus, R-1 is the inverse of R modulo N, R R-1 = 1 (mod N) and R = 2k mod N. Hence it can be easily implemented to
speed up the encryption and decryption process in VLSI circuits. Long carry propagation is a major problem in performing addition
for large operands in binary representation. To solve this problem, several approaches based on carry save addition were proposed
to achieve a significant speedup of Montgomery MM. These approaches can be divided into semi carry save (SCS) and full carry
save (FCS) strategy.
The works by Kim, Y.S. et al, Bunimov,V. et al and Zhengbing,H.et al proposed that in Semi Carry Save format, the
inputs and outputs of the Montgomery multiplication are represented in binary form but the intermediate results of modular
multiplication are kept in carry save format for avoiding carry propagation. However, the format conversion from the carry-save
representation of the final product into its binary representation must be performed at the end of each modular multiplication. This
conversion can be simply accomplished by adding the carry and sum terms of carry-save representation. But the addition still
suffers from long carry propagation, and extra circuit and time are probably needed for these conversions.
In Full Carry Save format given by Walter, C.D and Zhengbing, H et al maintaining all the inputs and outputs of the
Montgomery modular multiplication in carry-save form except the final step for getting the result of modular exponentiation.
However, this implies that the number of operands in modular multiplication must be increased so that additional registers to store
these operands are required. Therefore, the FCS based Montgomery multipliers possibly have higher hardware complexity and
longer critical path than SCS based multipliers.
A. Montgomery multiplication
Modular multiplication of two integers X and Y, simply performs,
S = A.B mod N
Given an integer an, where n is the k-hit modulus, Ais
A = a*r (mod N)
Where r=2k.
Likewise, given an integer b<n, Bis said to be its n-residuewith respect to r if,
B = b*r (mod N)

All rights reserved by www.grdjournals.com

463

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication


(GRDJE / CONFERENCE / ICIET - 2016 / 076)

The Montgomery product of A and B can then be defined as,


S = A*B*r-1 (mod N)
Where r-1 is the inverse of r, modulo n.
The radix-2 version of Montgomerys multiplication algorithm is shown in Fig.1

Fig. 1: MM algorithm

B. RSA
RSA is one of the most widely used public key algorithms at present.
The RSA encryption and decryption functions are given by
C= Me mod n
D= Cd mod n
Respectively, where M is a plain text message block, C is a cipher text block, n is the k-bit modulus, and e and d are the
public and private exponents respectively. The equation ed = 1(mod(p-1)(q-1)) must also hold, where p and q are two large prime
numbers and n = pq. Thus, an RSA operation is modular exponentiation with operands satisfying the conditions stated above. RSA
requires repeated modular multiplications to accomplish the computation of modular exponentiation and the size of modulus is
generally at least 1024 bits for long term security.
This paper aim at enhancing the performance of CSA based SCS Montgomery multiplier while maintaining reduced delay
through pipelining. The proposed method is implemented in RSA public key algorithm to increase the speed and throughput of
RSA cryptosystems.

II. EXISTING MONTGOMERY MULTIPLICATION


There were several SCS and FCS based Montgomery multipliers were proposed. Among several previous multipliers, the new
SCS-based Montgomery MM has the minimum delay and achieves high throughput compared to the other existing multipliers
described in [10].
The new SCS based Montgomery MM algorithm aimed to reduce the critical path delay and number of clock cycles for
completing one modular multiplication.
A. Architecture
On the bases of critical path delay reduction, clock cycle number reduction, and quotient pre-computation mentioned by
Kuang,S.R. et al, a new SCS-based Montgomery MM algorithm (i.e., SCS-MM-New algorithm shown in Fig. 4) using one-level
CCSA architecture as shown in Fig. 2 is proposed to significantly reduce the required clock cycles for completing one MM. This
CCSA architecture consists of one full adder or two half adders toperform carry save additions. The select signal decides whether
it performs full adder or half adder. If =1, full adder is selected.
The following equations [10] are used to derive the new SCS algorithm:
(1)
qi+1 = ( SS[i]1SC[i]1) (SS[i]0 SC[i]0)
(2)
2 ) (SS[i]1 SC[i]1)
qi+2 = ( SS[i]2SC[i]2) (qi
skipi+1= ~(Ai+1 (SS[i]1SC[i]1) (SS[i]0SC[i]0))

(3)

All rights reserved by www.grdjournals.com

464

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication


(GRDJE / CONFERENCE / ICIET - 2016 / 076)

Fig. 2: Proposed CCSA circuit.

Fig. 3: SCS-based MM-New architecture.

are first performed.Note that because qi+1, qi+2 must be


As shown in SCS-MM-New algorithm, steps 1-5 for producing and
th
generated in the i iteration, the iterative index i of Montgomery MM will start from -1 instead of 0 and the corresponding initial
values of and must set to 0. Furthermore, the original for loop is replaced with the while loop in SCS-MM-New algorithm to
skip some unnecessary iterations when skipi+1 = 1. In addition, the ending number of iterations in SCS-MM-New algorithm is
changed to k + 4 instead of k + 1.
The hardware architecture of SCS-MM-New algorithm, denoted as SCS-MM-New multiplier, are shown, which consists of
one one-level CCSA architecture, two 4-to-1 multiplexers (i.e., M1 and M2), one simplified multiplier SM3, one skip detector
Skip_D, one zero detector Zero_D, and six registers. Skip_D is developed to generate skip i+1, and in the ith iteration. Both M4
and M5 are 3-bit 2-to-1 multiplexers and they are much smaller than k-bit multiplexers M1, M and SM3. In addition, the area of
Skip_D is negligible when compared with that if the k-bit one-level CCSA architecture. The select signals of multiplexers M1 and
M2 are generated by the control part, which are not depicted for the sake of simplicity.
At the beginning of Montgomery multiplication, the FFs stored skipi+1, , are first reset to 0 as shown in step 1 Fig. 4 so that
= + can be computed via the one-level CCSA architecture. When performing the while loop, the skip detector as shown in Fig.
5, Skip_D is used to produce skipi+1, and . The Skip_D is composed of four XOR gates, one NOR gate, and two 2-to-1
multiplexers. It first generates the qi+1, qi+2 and skipi+1 signal in the ith iteration according to equations (1), (2), and(3) and then
selects the correct and according to skipi+1. At the end of the ith iteration, , and skipi+1 must be stored to FFs.

All rights reserved by www.grdjournals.com

465

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication


(GRDJE / CONFERENCE / ICIET - 2016 / 076)

B. SCS-based MM-New algorithm

Fig. 4. SCS-MM-New algorithm.

In the next clock cycle of the ith iteration, SM3 outputs a proper x according to and generated in the ith iteration as
shown in steps 9-12, M1 and M2 output the correct SC and SS according to skipi+1 generated in the ith iteration. If skipi+1 = 0, SC
1 and SS 1 are selected.Otherwise, SC 2 and SS 2 are selected, so that the right-shift 1-bit operations in steps 13 and 17
of SCS-MM-New algorithm are performed together in the next cycle of the iteration i.

Fig. 5: Skip Detector.

All rights reserved by www.grdjournals.com

466

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication


(GRDJE / CONFERENCE / ICIET - 2016 / 076)

In addition, M4 and M5 also select and output the correct SC[i]2:0 and SS[i]2:0 according to skipi+1 generated in the ith
iteration.
Note that SC[i]2:0 and SS[i]2:0 can also be obtained from M1 and M2 but a longer delay is required because they are 4-to1 multiplexers. After the while loop in steps 7-24 is completed, and stored in FFs are reset to 0. Then, the format conversion
= +
in steps 3 and 4.
in steps 26 and 27 can be performed by the SCS-MM-New multiplier similar to the computation of
Finally, SS [k + 5] equals to 0.

III. PROPOSED ARCHITECTURE


In the existing SCS-MM-New architecture (as shown in Fig. 3), each and every path in that multiplier will be analysed through
RTL. Simulation through coding or RTL view, the path having the maximum worst case delay will be found out. In that path,
additional buffers such as registers or flip-flops will be introduced and the clock is synchronized to reduce the worst case delay.
The concept of pipelining will be introduced and hence efficiency increases. Operating frequency is inversely proposed to critical
path. Therefore optimization will be done on Area, Power and Speed. Hence the proposed multiplier shown in Fig. 6 increases the
speed and reduces the delay comparing to the previous existing SCS-MM-New multiplier.

Fig. 6: Pipelined Montgomery Multiplier

The proposed pipelined architecture is implemented in Rivest Shamir Adleman algorithm (RSA), one of the public key
cryptosystems. Modular exponentiation is the main operation performed in RSA. Modular exponentiation is achieved through
repeated modular multiplication. Hence, the modular exponentiation in RSA algorithm is replaced by the proposed pipelined
Montgomery modular multiplier to increase the speed and throughput of RSA cryptosystems. Montgomery algorithm is used to
calculate modular exponentiation of two integer values in RSA algorithm. The simulation result is shown in the Fig. 6.

All rights reserved by www.grdjournals.com

467

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication


(GRDJE / CONFERENCE / ICIET - 2016 / 076)

Fig. 7: RSA simulation result

IV. EXPERIMENTAL RESULTS


The design is coded using VHDL, an HDL and synthesized and simulated using XILINX ISE 14.2 software. The worst case delay
path is analysed with the help of synthesis report. A register is included in that path which reduces the critical path delay. It is
difficult to directly compare the proposed multiplier with the previous designs as it adopts different technology. Hence the delay,
area and power are compared with the existing and proposed pipelined architectures for two bit key sizes. The results are given in
the Table 1.
Key
size

Delay
Area*
Power(W)
(ns)
Existing
5.379
33
3.237
1024
Proposed
5.136
36
3.171
Existing
7.381
38
3.171
2048
Proposed
7.138
43
3.122
Table 1: Comparison of existing and pipelined Montgomery multipliers with 1024 and 2048 bit key sizes.
* - number of slices occupied
Multiplier

V. CONCLUSION
SCS-MM-New multiplier has the shortest critical path delay and needs fewer clock cycles to complete one Montgomery MM, and
thus spends the least execution time and achieves the highest throughput rate. This paper presented a pipelined Montgomery
modular architecture to reduce the delay and power. While using the proposed multiplier architecture in the present day RSSA
algorithms, the computational speed in such algorithms increases which is a major advantage in the cryptosystems.

REFERENCES
[1] Bunimov, V.,Schimmler, M. and Tolg, B. (2002).A complexity-effective version of Montgomerys algorithm.Proc. Workshop
complex Effective Designs.
[2] Kim, Y.S., Kang, W.S. and Choi, J.R. (2000).Asynchronous implementation of 1024-bit modular processor for RSA
cryptosystem.Proc. 2nd IEEE Asia-Pacific Conf. ASIC, pp.187-190.
[3] Kuang, S.R., Wang, J.P., Chang, K.C. and Hsu, H.W. (2013). Energy-efficient high-throughput Montgomery modular
multipliers for RSA crytosystems.IEEE Trans, VLSI Syst. Vol.21, no.11, pp.1999-2009.
[4] Kuang,S.R., Kun-Yi Wu, and Ren-Yao Lu.(2015). Low-Cost High-Performance VLSI Architecture for Montgomery
Modular Multiplication.IEEE Trans. Very Large Scale Integr. (VLSI) Syst., Volume:PP , Issue:99.

All rights reserved by www.grdjournals.com

468

Pipelined VLSI Architecture for RSA Based on Montgomery Modular Multiplication


(GRDJE / CONFERENCE / ICIET - 2016 / 076)

[5] McIvor, C.,McLoone, M. and McCanny, J. V. (2004).Modified Montgomery modular multiplication and RSA exponentiation
techniques.IEE Proc.-Comput.Digit.Techn, Vol. 151, no. 6, pp. 402408.
[6] Montgomery, P.L. (1985).Modular multiplication without trial division Math.Comput., Vol. 44, no. 170, pp. 519521.
[7] Rivest, R.L., Shamir, A. and Adleman, L. (1978).A method for obtaining digital signatures and public-key
cryptosystemsCommun. ACM, Vol. 21, no. 2,pp. 120126.
[8] Walter, C.D. (1999), Montgomery exponentiation needs no final subtractions, Electron. Lett., Vol.35, no.21. pp.1831-1832.
[9] Zhengbing, H., Al Shboul, R.M. and Shirochin, V.P. (2007).An efficient
architecture of 1024-bits cryptoprocessor for RSA cryptosystem based on modified Montgomerys algorithm. Proc. 4th IEEE
Int. Workshop Intell.Data Acquisition Adv. Comput.Syst., pp.643-646.

All rights reserved by www.grdjournals.com

469

You might also like