Parallel Algorithm by Rc4

Applied parallel algorithm in
information security
(Rc4 Algorithm)
BY
Majed Ismael
Overview
Symmetric key based algorithms RC4

Type of Parallel Computer
Experimental setup
Design of PASCS
Performance and Scalability Analysis
Symmetric key based algorithms RC4
RC4 stream cipher was developed by Ron Rivest in 1987(Schneier, 2008).

The key size of the cipher is f up to 2048 bits (256 bytes). The algorithm is
extremely fast. Because of its speed, it is being used in many applications.
The algorithm is divided into two sub algorithms, one is for key generation
and another is for encryption. For encryption, the output of the generator is
XOR with the data stream.
Description of RC4
RC4 is the most common algorithm and is used in popular protocols like
secure socket layer (SSL) to protect web browsing and in WEP to protect the
wireless networks (Schneier, 2008). Other application areas of RC4 are
Skype and Bit Torrent protocol system. RC4 generates key stream that is
random stream of bits. The key stream is combined with the plaintext using
bit-wise XOR to generate the encrypted text. The algorithm has two main
parts: the key scheduling algorithm (KSA) and the pseudo random
generation algorithm (PRGA).
The KSA is used to initialize the permutations in the ‘S’ array. The "key length" is the
number of bytes in the key and the range of “key length” is from 1 to 256.
For i=0 to 255
S[i]:= i
End loop
Set j: = 0
For i=0 to 255
Set J: = (j + S[i] + key [i mod key length]) mod 256
Swap values of S[i] and S[j]
End loop
Algorithm The key Scheduling Algorithm (KSA) (Schneier, 2008)

This byte is then XOR with one individual letter in plaintext to convert it into cipher text
While Generating Output:

j=0;
For i=1 to 256
i := (i + 1) mod 256
j := (j + S[i]) mod 256

End loop
Swap values of S[i] and S[j]
K: = S [(S[i] + S[j]) mod 256]

Cipher text= (K) XOR (Plaintext)
Output
End loop
Algorithm The Pseudo-Random Generation Algorithm (PRGA) (Schneier,
2008)
Type of Parallel Computer
There are different methods to categorize parallel computers. According to Flynn's

Taxonomy (Barney, 2010) multi-processor computer architecture system is organized
according to Instruction Stream and Data Stream. Each of these proportions can have
only one of two possible states: Single or Multiple. There are four possible
classifications:
i. Single Instruction, Single Data (SISD):
• A serial computer
• Single Instruction: one instruction stream is being executed by the CPU during
one clock cycle.
• Single Data: one data stream is being used as input data during one clock
cycle.
• Examples: mainframes, minicomputers and workstations.
ii. Single Instruction, Multiple Data (SIMD):
• A form of parallel computer
• Single Instruction: All processing elements execute the similar instruction at

any given clock cycle.
• Multiple Data: Each processing element can operate on dissimilar data

elements.
• Graphics processing units (GPUs), AMD and Intel’s multicore processors

are available in market.
iii. Multiple Instructions, Single Data (MISD):
• Multiple Instructions: Each core/processing unit operates on the data

independently by using separate instruction streams.
• Single Data: data in the form of sequence of bits/bytes is being used as input
data during one clock cycle.
• Some conceivable uses of this type of system is might be:
• Multiple security algorithms attempting to break a single coded message.
• For fault-tolerance purposes

iv. Multiple Instructions, Multiple Data (MIMD):
• Multiple Instructions: Each core/processing unit operates on the data

independently by using separate instruction streams.
• Multiple Data: Every core will work with a different data stream
• Currently, the most common type of parallel computer falls in this category is
Supercomputers.
• Examples: supercomputers, multi-processor SMP computers and multi-core

PCs.
Experimental setup
A multi core processor is a single computing component(Geer, 2005). But it can

have two or more independent actual cores. These units can read and execute
various tasks and instructions concurrently, increasing overall speed of programs
which are adaptable to parallel computing. Also, this architecture will enhance
performance and reduce power consumption which will in turn serve as a
contribution towards greenhouse effects. Figure-1 shows the architecture of multi
core processor.
‫ يمكن لهذه الوحدات قراءة وتنفيذ العديد من‬.‫ولكن يمكن أن تحتوي على اثنين أو أكثر من النوى الفعلية المستقلة‬
‫ضا‬
ً ‫ أي‬.‫ مما يزيد من السرعة اإلجمالية للبرامج القابلة للتكيف مع الحوسبة المتوازية‬، ‫المهام والتعليمات بشكل متزامن‬
‫ ستعمل هذه البنية على تحسين األداء وتقليل استهالك الطاقة والتي بدورها ستكون بمثابة مساهمة في تأثيرات‬،
.‫ بنية المعالج متعدد النواة‬1 ‫ يوضح الشكل‬.‫الدفيئة‬
For this research, the machine setup is done with the following configuration:
• Processor - AMD FX(tm) - 8320 , eight core processor running @ 3500
MHz
• RAM - 8 GB
• System Type - 64 bit operating
• Operating System - Linux/Ubuntu 12.04 version
Core-1 Core-2 Core-3 Core-4
private private private private

memory memory memory memory
shared memory
Bus Interface
Fig. 1 Multi-Core Processor Architecture

Design of PASCS
In this framework, a key is supplied to the key stream generator which will produce
random key stream. The plaintext is in the form of fixed sized blocks. The random
key stream is supplied to each individual block to process plaintext concurrently. The
following figure illustrates the complete process used in this framework.
Fig.2 Design of Parallel Additive Stream

Cipher Structure
As shown in Fig.2, there are n fix size data blocks. Random key stream of same length
is supplied to each block and further each bit from plaintext block is XOR with key bit to
produce the cipher text bit. This parallel structure can be used by any stream cipher
which is of synchronous nature. The size of the block depends upon the algorithm’s
structure. PASCS is based on the concept of vernam cipher where corresponding to
each bit of plain text there is individual key bit. To keep the essence of vernam cipher
and maintain its randomness, each block should have different key stream. Hence,
modification in architecture is required to implement the stream cipher algorithms. In this
thesis, the PASCS is applied to RC4 and RC4A algorithms to analyze the impact of
parallelization on the speed of the cipher. We discuss RC4 and RC4A in later chapters.
‫ وكذلك كل‬، ‫يتم توفير دفق عشوائي عشوائي من نفس الطول لكل كتلة‬n ‫ هناك كتل بيانات حجم ال‬، ‫كما هو مبين في الشكل‬
‫ يمكن استخدام هذا الهيكل المتوازي‬.‫مع مفتاح بت إلنتاج بت النص المشفر‬XOR ‫بت من كتلة النص غير العادي هي‬
‫على مفهوم تشفير‬PASCS ‫ يعتمد نظام‬.‫ يعتمد حجم الكتلة على بنية الخوارزمية‬.‫بواسطة أي تشفير تيار ذو طبيعة متزامنة‬
‫والحفاظ‬vernam ‫ للحفاظ على جوهر تشفير‬.‫حيث تتوافق كل بتة من النص العادي مع وجود مفتاح بت فردي‬vernam
‫ فإن التعديل في الهندسة المعمارية مطلوب لتنفيذ‬، ‫ وبالتالي‬.‫ يجب أن يكون لكل كتلة تيار مفتاح مختلف‬، ‫على العشوائية‬
‫لتحليل تأثير التوازي‬RC4A ‫و‬RC4 ‫على خوارزميات‬PASCS ‫ يتم تطبيق‬، ‫ في هذه الرسالة‬.‫خوارزميات تشفير الدفق‬
.‫في الفصول الالحقة‬RC4A ‫و‬RC4 ‫ نناقش‬.‫على سرعة التشفير‬
Procedure: Encryption
Model: Data Parallel Model with P processors [P=2, 4, 6]

Input: Plaintext in the form of small chunks [Chunk Size = 256], n=number of blocks
Output: Encrypted text,
Declare: Plaintext and BlockID as shared variable, i as private variable to each processing
element
parallel algorithm PARC4
1. Begin
2. For ALL BlockID: [0, n] IN SYNC
3. Set Start=BlockID and End=Start+256
4. For i=start to End-1 do
5. Output= ((keystream_bit XOR BlockID) Mod 256) >>>mean XOR plaintext_bit
6. End for
7. End
Algorithm : Steps to Implement PARC4

In Algorithm, plaintext is declared as shared variable because this data needs to be
accessed by each core in small chunk sizes and the block size variable should be
known to each core as shown in Fig. 3
Fig.3 Graphical Representation of

Complete Flow and Model Used to
Parallelize RC4
Experimental Results
To study performance improvements achieved through the parallelization of the

RC4 algorithm, firstly, the sequential RC4 cryptographic algorithm has been
executed to evaluate its execution time in a given environment. The sequential
results serve as the baseline for comparison with the results as shown in tables
1,2,3,4,5 for the parallel algorithm PARC4
Table 1: Time (In Seconds) taken by RC4 to encrypt/decrypt large data files by uniprocessor
Size of input Encryption Decryption Overall Time

data [In GB ]
0.1 1.31785 1.29868 2.61653
0.2 2.64969 2.62575 5.27544
0.3 3.87678 3.85207 7.72885
0.4 5.29639 5.24420 10.54059
0.5 6.46599 6.41855 12.88454
0.6 7.73803 7.67800 15.41603
0.7 9.02553 9.16311 18.18864
0.8 10.58992 10.93706 21.52698
0.9 11.59993 11.85782 23.45775
1.0 12.89147 12.97361 25.86508
Table 2: Time (In Seconds) taken by PARC4 to encrypt/decrypt large data files using 2 Cores

data [In GB ]
0.1 0.67892 0.67872 1.35764

0.2 1.33886 1.33878 2.67764
0.3 2.0177 2.0176 4.03528
0.4 2.69656 2.69636 5.39292
0.5 3.37532 3.37524 6.75056
0.6 4.05413 4.05407 8.1082
0.7 4.73296 4.73288 9.46584
0.8 5.41179 5.41169 10.8235
0.9 6.09061 6.09051 12.1811
1.0 6.76942 6.76934 13.5388
Table 3: Time (In Seconds) taken by PARC4 to encrypt/decrypt large data files using 4 cores

data [In GB ]
0.1 0.399695 0.399295 0.79899
0.2 0.799947 0.799943 1.59989
0.3 1.199905 1.199903 2.39981
0.4 1.640765 1.640725 3.28149
0.5 1.99994 1.9999 3.99984
0.6 2.362748 2.362742 4.72549
0.7 2.797925 2.797885 5.59581
0.8 3.299505 3.299465 6.59897
0.9 3.649495 3.649455 7.29895
1.0 3.974905 3.974865 7.94977
Table 4: Time (In Seconds) taken by PARC4 to encrypt/decrypt large data files using
6 cores
data [In GB ]
0.1 0.25999 0.25995 0.51994
0.2 0.499937 0.499933 0.99987
0.3 0.769528 0.76952 1.53905
0.4 1.04601 1.04597 2.09198
0.5 1.28548 1.28547 2.57095
0.6 1.537027 1.537023 3.07405
0.7 1.844995 1.844955 3.68995
0.8 2.14746 2.14726 4.29472
0.9 2.33866 2.3386 4.67726
1.0 2.574919 2.574911 5.14983
Table 5 Time taken by PARC4-I to encrypt/decrypt large data files using 8 cores
Data files [In GB ] Encryption time Decryption time Overall time

0.1 0.09248 0.09242 0.1849
0.2 0.21614 0.21606 0.4322
0.3 0.34438 0.34432 0.6887
0.4 0.49144 0.49136 0.9828
0.5 0.63823 0.63817 1.2764
0.6 0.87077 0.87073 1.7415
0.7 1.0294 1.0293 2.0587
0.8 1.21104 1.21097 2.422
0.9 1.30337 1.30333 2.6067
1.0 1.45277 1.45273 2.9055
Performance and Scalability Analysis

Speedup
A serial algorithm is typically assessed in terms of its execution time which is stated as
a function of its input size. In contrast, the execution time of a parallel algorithm is
determined by the input size as well as the parallel structural design and the number of
processing elements employed. With this, the speedup is well-defined as the ratio of
the time taken to solve a problem using a single processing element to the time
required to execute the same problem using a parallel computer with p identical
processing cores. From the Tables 1 to 5, it can be observed
that PARC4 results in speedup corresponding to the number of cores being used for
experiments. Fig. 4 shows the speedup comparison for ~1GB of data file by using
PARC4 on multiple cores.
Fig .4: Speedup comparison of PARC4

using multiple cores
 This approve ability of the GPU to deal with
large data
 Complexity of parallel O(n)
 n is the number of blocks multiplied by 256
which is equal to serial n iterations. This
proves it remains linear in nature making
PARC4 cost optimal.

Parallel Algorithm by Rc4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Algorithm by Rc4

Uploaded by

Copyright:

Available Formats

Applied parallel algorithm in

Symmetric key based algorithms RC4

RC4 stream cipher was developed by Ron Rivest in 1987(Schneier, 2008).

For i=0 to 255

For i=0 to 255

Set J: = (j + S[i] + key [i mod key length]) mod 256

Swap values of S[i] and S[j]

Algorithm The key Scheduling Algorithm (KSA) (Schneier, 2008)

While Generating Output:

j := (j + S[i]) mod 256

Swap values of S[i] and S[j]

K: = S [(S[i] + S[j]) mod 256]

Algorithm The Pseudo-Random Generation Algorithm (PRGA) (Schneier,

There are different methods to categorize parallel computers. According to Flynn's

i. Single Instruction, Single Data (SISD):

ii. Single Instruction, Multiple Data (SIMD):

• A form of parallel computer

• Single Instruction: All processing elements execute the similar instruction at

• Multiple Data: Each processing element can operate on dissimilar data

• Graphics processing units (GPUs), AMD and Intel’s multicore processors

• A form of parallel computer

• Multiple Instructions: Each core/processing unit operates on the data

• Some conceivable uses of this type of system is might be:

• Multiple security algorithms attempting to break a single coded message.

• For fault-tolerance purposes

• A form of parallel computer

• Multiple Instructions: Each core/processing unit operates on the data

• Examples: supercomputers, multi-processor SMP computers and multi-core

A multi core processor is a single computing component(Geer, 2005). But it can

private private private private

Fig. 1 Multi-Core Processor Architecture

Fig.2 Design of Parallel Additive Stream

Model: Data Parallel Model with P processors [P=2, 4, 6]

2. For ALL BlockID: [0, n] IN SYNC

3. Set Start=BlockID and End=Start+256

4. For i=start to End-1 do

5. Output= ((keystream_bit XOR BlockID) Mod 256) >>>mean XOR plaintext_bit

Algorithm : Steps to Implement PARC4

Fig.3 Graphical Representation of

To study performance improvements achieved through the parallelization of the

Size of input Encryption Decryption Overall Time

Size of input Encryption Decryption Overall Time

0.1 0.67892 0.67872 1.35764

Size of input Encryption Decryption Overall Time

Data files [In GB ] Encryption time Decryption time Overall time

Performance and Scalability Analysis

Fig .4: Speedup comparison of PARC4

You might also like