Towards Efficient Execution of Erasure Codes On Multicore Architectures

Towards Efficient Execution of Erasure Codes on
Multicore Architectures
Roman Wyrzykowski, Lukasz Kuczynski, and Marcin Wozniak
Institute of Computer and Information Sciences,
Czestochowa University of Technology,
Dabrowskiego 73, 42-201 Czestochowa, Poland
{roman,lkucz,marcell}@icis.pcz.pl
Abstract. Erasure codes can improve the availability of distributed

storage in comparison with replication systems. In this paper, we focus on investigating how to map systematically the Reed-Solomon and
Cauchy Reed-Solomon erasure codes onto the Cell/B.E. and GPU multicore architecture. A method for the systematic mapping of computation
kernels of encoding/decoding algorithms onto the Cell/B.E. architecture
is proposed. This method takes into account properties of the architecture on all three levels of its parallel processing hierarchy. The performance results are shown to be very promising. The possibility of using
GPUs is studied as well, based on the Cauchy version of Reed-Solomon
codes.
Keywords: Erasure codes, Reed-Solomon codes, Cauchy Reed-Solomon codes,

multicore architectures, Cell/B.E., GPU
Introduction
There is a rapid increase in sensitive data, such as biomedical records or financial data. Protecting such data while in transit as well as while at rest is crucial
[6]. An example are distributed data storage systems in grids [18], that have
different security concerns than traditional file systems. Rather than being concentrated in one place, data are now spread across multiple hosts. Failure of a
single host or an adversary taking control of a host could lead to loss of sensitive
data, and compromise the whole system. Consequently, suitable techniques, e.g.
cryptographic algorithms and data replication, should be applied to fulfill such
key requirements as confidentiality, integrity, and availability [18, 19].
A classical concept of building fault-tolerant systems consists of replicating
data on several servers. Erasure codes can improve the availability of distributed
storage by splitting up the data into n blocks, encoding them redundantly using
m blocks, and distributing the blocks over various servers [2]. As was shown in
[15], the use of erasure codes reduces mean time of failures by many orders of
magnitude compared to replication systems with similar storage and bandwidth
requirements.
Efficient Execution of Erasure Codes on Multicore Architectures
There are many ways of generating erasure codes. A standard way is the use
of the Reed-Solomon (or RS) codes [10]. The main disadvantage of this approach
is a large computational cost because all operations, including multiplications,
are performed over the Galois field GF (2w ) arithmetic, which is not supported
by modern microprocessors, where 2w n + m. In this context, an interesting alternative are the Digital Fountain Codes, or more generally, Low-Density
Parity-Check (LDPC) codes [7]. Their implementation can be reduced to a series
of bitwise XOR operations. However, this potential advantage of LDPC codes is
not always realized in practice [13], when relatively small values of n are often
used. In particular, it was shown that for the encoding ratio r = n/(n+m) = 1/2,
the performance of RS codes is not worse than that of LDPC codes if n 50.
This relationship depends on the ratio between the performance of a network,
and performance of processing units used for encoding/decoding. For a constant
network performance, increasing the performance of processing units gives advantage to the RS codes.
The last conclusion is especially important nowadays when multicore architectures begin to emerge in every area of computing [16]. Furthermore, an
important step in the direction of improving the performance of RS codes has
been done recently, when a Cauchy version of these codes was proposed [14]. In
particular, this new class of codes (CRS codes, for short) does not require performing any multiplication using the Galois field arithmetic; a series of bitwise
XOR operations is executed instead.
In this work, we focus on investigating how to systematically map the RS and
CRS erasure codes onto the Cell/B.E. architecture [1]. This innovative heterogeneous multicore chip is significantly different from conventional multiprocessor or multicore architectures. The Cell/B.E. integrates nine processor elements
(cores) of two types: the Power processor element (PPE) is optimized for control
tasks, while the eight synergistic processor elements (SPEs) provide an execution environment optimized for data-intensive processing. Each SPE supports
vector processing on 128-bit words, implemented in parallel by two pipelines.
Each SPE includes 128 vector registers, as well as a a private local store for fast
instruction and data access. The EIB bus connects all the cores with a highperformance communication subsystem. Also, the Cell/B.E. offers an advanced,
hardware-based security architecture [19]. The impressive computational power
of Cell/B.E., coupled with its security features, make it a suitable platform to
implement algorithms aimed at improving data confidentiality, integrity, and
availability [18, 19].
In the last part of this paper, we study the possibility of using another, very
promising type of multicore architectures which are GPUs (Graphics Processing
Units) [5, 17]. Basic features of GPUs include utilization of a large number of
relatively simple processing units which operate in the SIMD fashion, as well as
hardware supported, advanced multithreading. For example, Nvidia Tesla C1060
is equipped with 240 cores, delivering the peak performance of 0.93 TFLOPS.
A tremendous step towards a wider acceptation of GPUs in general-purpose
computations was the development of software environments which made it pos-
sible to program GPUs in high-level languages. The new software developments,

such as Nvidia CUDA [8] and OpenCL [9], allow programmers to implement
algorithms on existing and future GPUs much easier.
Reed-Solomon Codes and Linear Algebra Algorithms
More precisely, an erasure code works in the following way. A file F of size |F | is
partitioned into n blocks (stripes) of size B words each, where B = |F |/n. Each
block is stored on one of n data devices D0 , D1 , ..., Dn1 . Additionally, there are
m checksum devices C0 , C1 , . . ., Cm1 . Their contents are derived from contents
of data devices, using a special encoding algorithm. This algorithm has to allow
for restoring the original file from any n (or a bit more) of n + m storage devices
D0 , D1 , . . ., Dn1 , C0 , C1 , . . ., Cm1 , even if m of these devices failed, in the
worst case.
The application of the RS erasure codes includes [10, 11] two stages: (i) encoding, and (ii) decoding. At the encoding stage, an input data vector dn =
[d0 , d1 , . . . , dn1 ]T , containing n words, each of size w bits, is multiplied by a
special matrix

Inn
F(n+m)n =
.
(1)
Fmn
Its first n rows correspond to the identity matrix, while the whole matrix is
derived as a result of transforming an (n + m) n Vandermonde matrix, with
elements defined over the Galois field GF (2w ).
The result of the encoding is an (n + m) column vector

dn
,
(2)
en+m = F(n+m)n dn =
cm
where:
cm = Fmn dn .
(3)
Therefore, the encoding stage can be reduced to performing many times the
matrix-vector multiplication (3), where all operations are carried out over GF (2w ).
At the decoding stage, the following expression is used to reconstruct failed
data from non-failed data and checksum devices:
dn = 1
nn en ,
(4)
where the inverse matrix 1

nn is computed from those rows of the matrix
F(n+m)n that correspond to non-failed data and checksum devices.
3
3.1
Mapping Reed-Solomon Erasure Codes and Their

Cauchy Version onto Cell/B.E. Architecture
Mapping Reed-Solomon Codes
In our investigation, we focus on mapping the following expression:

CmB = Fmn DnB ,
(5)
which is obtained from Eqn. (3) taking into consideration the necessity to process
not a single vector dn , but B such vectors. An expression of the same kind is
used at the decoding stage. Moreover, in this work we neglect the influence of
computing the inverse matrix 1
nn on the performance of the whole algorithm.
The cost of this operation can be neglected for relatively small values of m, which
are of our primary interest in the case of distributed data storage in grids [18].
In this work, we propose a method for the systematic mapping of Eqn. (5)
onto the Cell/B.E. architecture. This method takes into account properties of
the architecture on all three levels of its parallel processing hierarchy, namely:
1. eight SPE cores running independently, and communicating via the EIB bus;
2. vector (SIMD) processing of 16 bytes in each SPE core;
3. executing instructions by two pipelines (odd and even) in parallel.
For this aim, Eqn. (5) is decomposed into a set of matrix-matrix multiplications:
Cm16 = Fmn Dn16 .
(6)
To compute each of these multiplications within a corresponding SPE core using

its SIMD parallel capabilities, the following vectorization algorithm is proposed:
for i = 0, 1, . . . , m 1 do {
ci = [0, 0, . . . , 0]
for j = 0, 1, . . . , n 1 do
ci := ci fi,j
dj
(7)
}
where:
vector fi,j
is obtained by copying element (byte) fi,j
of matrix Fnmj onto
all 16 elements (bytes) corresponding to a vector register of SPE;
operation is the element-by-element multiplication of two vectors, implemented over GF (28 );
denotes the bitwise XOR operation.
Furthermore, to execute this algorithm efficiently on a SPE core, the multiplication operation of the form c = f d is implemented using table lookups
[11], based on the following formula:
c = gf ilog(gf log(f ) + gf log(d)) .
(8)
Here gf log and gf ilog denote respectively logarithms and antilogarithms, defined over GF (2w ). Their values are stored in two tables, whose length does
not exceed 256 bytes. Following our previous work [19], the efficient implementation of table lookups required by Eqn. (8) is based on utilization of shufb
permutation instruction, which performs 16 simultaneous byte table lookups in
a 32-entry table. Larger tables are addressed using a binary-tree process on a
series of 32-entry table lookups, when successive bits of the table indices are used
as a selector (using selb instruction) to choose the correct sub-table value.
Fig. 1. An example of encoding using the Cauchy binary matrix (based on [14])
3.2
Mapping Cauchy Reed-Solomon Codes
In case of the Cauchy version of RS codes, the matrix Fmn is transformed into
a wm wn binary matrix. An example of such a Cauchy matrix is shown in
Fig. 1, for GF (23 ), n = 5, m = 2. As a result, any multiplication over the Galois
field is reduced to a series of bitwise XOR operations. For example, the following
expression is used to compute the checksum c1,2 from Fig. 1:
c1,2 = d0,0 d1,2 d2,1 d2,2 d3,0 d3,2 d4,0 d4,1 ,
(9)
where dj,l denotes the l-th package of the j-th data device, l = 0, 1, . . . , w 1.
The mapping method proposed in Subsection 3.1 can be applied in this case
as well, providing that properties of binary Cauchy matrices are taken into account. In particular, the vectorization algorithm takes the following form:
for i = 0, 1, . . . , m 1 do {
for k = 0, 1, . . . , w 1 do {
ci,k = [0, 0, . . . , 0]
for j = 0, 1, . . . , n 1 do {
for l = 0, 1, . . . , w 1 do
ci,k := ci,k fi,k,j,l

dj,l
(10)
}
}
}
where:
are equal to 1 or 0;
coefficients fi,k,j,l
depending on fi,k,j,l
, the innermost loop operation reduces to a XOR operation with either the vector dj,l , or [0, 0, . . . , 0].
One of the most important conclusions from this algorithm is the necessity
to consider a sparse format of representing the Cauchy matrix, besides the dense
format. By introducing some additional overhead, the use of sparse format [16]
allows us to avoid operations with nonzero coefficients fi,k,j,l

.
4
4.1
Performance Results on Cell/B.E. Processor

Using Reed-Solomon Codes
In Table 1, we present the performance results achieved for three different implementations of the encoding procedure (7). The pair of values n, m = 4 was
applied as one of the most promising options to be used for our distributed data
storage system in the ClusteriX grid [18]. This table shows the number LC of
clock cycles necessary to process by a single SPE core either one (LB = 1) or
ten (LB = 10) data packages each of size n 16 bytes. The variants correspond
to different optimization of the program code performed manually.
Based on Table 1, we can estimate the maximum bandwidth for encoding
data on all 8 SPEs as:
bRS
= (8 3.2 LB n 16)/LC = 9.58 [GB/s] .
8
(11)
Such a high value of the bandwidth bRS

means that in real circumstances
8
it is no longer a constraint for performance of the whole system. For example,
in the above-mentioned ClusteriX grid this performance is constrained by the
bandwidth of 2 10 Gb/s available in the wide-area network PIONIER, which
is used to connect local clusters.
4.2
Using Cauchy Reed-Solomon Codes
Using the open-source Jerasure library [12] for m, n = 4, w = 3, we generate

the Cauchy matrix Fwmwn = F1212 , which contains 88 nonzero elements
among all 144 elements. For the Cauchy version of RS codes, Eqn. (11) takes
the following form:
bCRS
= (8 3.2 LB n w 16)/LC [GB/s] .
8
(12)
Table 1. Performance results (number LC of clock cycles) for different variants of

implementing Reed-Solomon encoding
Compiler
variant 1
variant 2
variant 3
option LB = 1 LB = 10 LB = 1 LB = 10 LB = 1 LB = 10
O1
223
2118
201
2078
214
2211
O2
215
1990
198
1710
215
1770
O3
215
1990
198
1710
215
1770
Then by substituting here the values of LC achieved experimentally for the dense
and sparse formats of representing the Cauchy matrix, we obtain the following
estimations, respectively:
bCRSC
= 13.65 GB/s ;
8,D
bCRSC
= 62.2 GB/s .
8,S
These results confirm that the sparse format allows for achieving a much
higher performance than the dense one. However, the current implementation
for the sparse format is not flexible. A further investigation is required in order
to combine the flexibility of a program code with high performance.
Keeping in mind the achieved value bRS
of bandwidth for the classic RS
8
codes, we can also conclude that for the experimental setting considered in this
Section, it does not make sense to use the Cauchy version of RS codes instead of
the classic one. Also, in practice the estimated value of bCRSC
is constrained by
8,S
the maximum bandwidth of access to the main memory of Cell/B.E. processor,
which is equal to 25.6 GB/s. However, the rationale for utilization of the CRS
erasure codes could have a place when considering other multicore architectures
than the Cell/B.E. processor, or other distributed storage systems characterized
by different values of parameters m and n than those studied in this Section.
Implementing CRS Codes on Nvidia Tesla C1060 GPU
The CUDA programming environment [8] makes it possible to develop parallel

applications for both the Windows and Linux operating systems, giving access to
a well-designed programming interface in the C language. On a single GPU, it is
possible to run several CUDA and graphics applications concurrently. However,
the utilization of GPUs in an everyday practice is still limited. The main reason
is the necessity of adapting implemented applications and algorithms to a target
architecture, in order to match its internal characteristics. This paper deals with
the problem of how to perform such an adaptation efficiently for the encoding
stage in the Cauchy version of Reed-Solomon codes.
The CUDA software architecture includes two modules dedicated respectively
to a general-purpose CPU, and a graphic processor. This allows for utilization
of GPU as an application accelerator, when a part of the application is executed
on a standard processor, while another part is assigned to GPU, as a so-called
kernel. The allocation of GPU memory, data transfers, and kernel execution
are initialized by the CPU. Each data item used in the GPU needs to be copied
from the main memory to the GPU memory; each of such transfers is a source of
latency which affects the resulting performance negatively [4]. These performance
overheads can be reduced in CUDA using the stream processing mechanism.
It allows for overlapping kernel computations with data transfers between the
main memory and the GPU memory using the asynchronous CUDA API, which
immediately returns from CUDA calls before their completion
Another key feature of modern GPUs is their hierarchical memory organization, which includes several levels with different volume and access time. First
of all, GPUs are equipped with a Global Memory accessible by all threads (read
and write). However, access to this relatively large memory is rather expensive.
Other types of GPU memory, accessible for all the threads running on a graphics
card, are Constant Memory and Texture Memory. Their access time is shorter,
but threads are allowed only to read from these memories. Threads within a
particular CUDA block share a fast Shared Memory, which is used for communication and synchronization among threads across a block. Finally, after being
initialized, each thread obtains access to a pool of registers.
5.1
Mapping CRS Codes onto GPU Architecture
The issue of how to implement erasure codes on GPUs using the Reed-Solomon
approach was investigated in [3, 4]. The necessity to perform expensive multiplications over the Galois Field GF (2w ) limits the performance achieved for such
an approach. Therefore, we decided to investigate the possibility of using the
Cauchy version of Reed-Solomon codes for GPUs. For this aim, we have implemented a modified version of the encoding algorithm (10), which is shown below:
for j = 0, 1, . . . , n 1 do {
for l = 0, 1, . . . , w 1 do {
GPU kernel
for i = 0, 1, . . . , m 1 do
for k = 0, 1, . . . , w 1 do
ci,k := ci,k fi,k,j,l

dj,l
(13)
}
}
where dj,l and ci,k are data and checksums vectors, respectively; each of them
consists of LE = |F |/(4 n w) elements of int type. The total number LT of
created threads should be greater or equal to LE . For example, when encoding a
file of size 192 MB with m, n = 4, w = 3, this constraint gives LT = 192 1024
1024/(4 4 3) = 201326592/48 = 4194304 threads. Assuming the maximum
number of threads within a single block (512 threads), we should create 8192
blocks of threads.
The proposed modification of the encoding algorithm allows us to utilize the
stream processing mechanism, when transfer of a certain data stream (vector
dj,l ) is performed in a particular step. After copying the vector dj,l to the GPU
memory, the GPU kernel is invoked. Each of the GPU threads created in this
way is responsible for the execution of XOR operations for a single element
of the vector dj,l , and corresponding elements of checksum vectors ci,k . The
resulting distribution of computation among threads, as well as the organization
of data in the GPU memory, allows us to optimize access to the available Global
Memory, since consequent threads access data in contiguous areas of memory.
Moreover, each thread fetches the vector dj,l only once, utilizing it m w times
for computations.
The Cauchy matrix F is small and constant; it is located in the Texture
Memory in order to speed up fetching elements of the matrix by GPU threads. In
this work, both the dense and sparse formats of representing the Cauchy binary
matrix were implemented. In particular, to represent the sparsity structure of
the Cauchy matrix, the standard Compressed Sparse Row (CSR) format [16, 17]
was used.
5.2
Performance Results on Nvidia Tesla C1060
The performance experiments were carried out for the platform containing Tesla
C1060 CPU, and AMD PhenomII X4 3.12GHz CPU, with CUDA 2.2 as a software environment. In this platform, GPU and CPU are coupled through the PCIe
x16 bus (version 2.0), which provides the maximum bandwidth of 8 GB/s. The
experimental results are presented in Tables 2 and 3, for two sets of parameters:
(i) n, m = 4, w = 3, and (ii) n = 8, m = 4, w = 4, respectively.
These tables show the real bandwidth of data encoding on a CPU accelerated by graphics processor (GPU + CPU bandwidth). When measuring this
bandwidth, we take into account the following phases: (i) memory allocation
and data copying from the main memory to the GPU memory, and (ii) encoding
on the GPU. The phase of transferring results back to the main memory is not
considered because this phase can be overlapped with transfer of data from CPU
to GPU, for the next file. Also, we measure the bandwidth achieved when the
general-purpose CPU is used solely (last column), as well as the performance
achieved by the GPU kernel (without any interaction with the CPU).
Table 2. Bandwidth achieved for GPUs and CPUs when encoding files with different
size (n, m = 4, w = 3)
dense format
sparse format
File Number of GPU kernel GPU+CPU GPU kernel GPU+CPU
CPU
size CUDA blocks bandwidth bandwidth bandwidth bandwidth bandwidth
[MB]
[MB/s]
[MB/s]
[MB/s]
[MB/s]
[MB/s]
0.05
2
74
57
79
64
241
0.09
4
155
126
158
128
117
0.19
8
293
238
314
249
47
0.38
16
561
407
563
409
48
0.75
32
847
609
827
598
49
1.5
64
1139
899
1141
902
48
3
128
1466
1251
1456
1248
48
6
256
1504
1375
1507
1377
45
12
512
2047
1911
2038
1906
44
24
1024
2097
2021
2080
2002
47
96
4096
2136
2110
2127
2103
44
384
16384
2144
2132
2142
2131
47
10
Table 3. Bandwidth achieved for GPUs and CPUs when encoding files with different
size (n = 8, m = 4, w = 4)
dense format
sparse format
File Number of GPU kernel GPU+CPU GPU kernel GPU+CPU
CPU
size CUDA blocks bandwidth bandwidth bandwidth bandwidth bandwidth
[MB]
[MB/s]
[MB/s]
[MB/s]
[MB/s]
[MB/s]
0.06
2
42
38
39
36
42
0.13
4
83
75
81
74
36
0.25
8
166
149
161
145
18
0.5
16
286
249
292
249
16
1
32
437
379
441
381
6
2
64
580
525
580
527
6
4
128
740
689
740
690
5
8
256
754
725
755
727
5
16
512
1107
1075
1106
1075
5
32
1024
1124
1106
1125
1107
5
128
4096
1165
1158
1164
1158
5
512
16384
1168
1162
1168
1164
5
In general, the advantage of using a GPU as an accelerator against a solely

CPU-based implementation is reduced by the overhead caused by data transfers
between the CPU and the GPU. The results of the experiments confirm that this
overhead is compensated by using the GPUs parallel processing capabilities
even for relatively short files, with size of several hundred kilobytes. For files
containing several megabytes, the accelerated environment processes data more
than ten times faster than CPU, for n, m = 4. When encoding large files, up
to several hundred megabytes, it becomes possible to achieve more than 2.1 GB
of bandwidth. For n = 8, m = 4, the advantage achieved by the accelerated
platform against the general-purpose CPU is even larger.
Another conclusion is a similar efficiency of using the dense and the sparse
formats of representing the Cauchy matrix F . The use of sparse representation reduces the amount of computations. However, this reduction is balanced
with the additional overhead related to indirect addressing of elements of the
checksum vectors ci,k . This effect is not surprising since the degree of sparsity of
Cauchy matrices is relatively low. For the matrices F corresponding to Tables
2 and 3, less than 50% of all the elements of these matrices (41.9% and 48.2%,
respectively) are zeros.
Conclusions
Erasure codes can radically improve the availability of distributed storage in

comparison with replication systems. In order to realize this thesis, efficient implementations of the most compute-intensive parts of the underlying algorithms
should be developed. The investigation carried out in this work confirms the advantage of using modern multicore architectures for the efficient implementation
of the classic Reed-Solomon erasure codes, as well as their Cauchy modification.
11
References
1. Chen, T., Raghavan, R. , Dale, J.N., Iwata, E.: Cell Broadband Engine Architecture
and its first implementation: A performance view. IBM Journal of Research and
Development 51 (5), 559572 (2007)
2. Collins, R., Plank, J.: Assessing the Performance of Erasure Codes in the WideArea. In: Proc. 2005 Int. Conf. Dependable Systems and Networks DSN05, pp.
182 187, IEEE Computer Society (2005)
3. Curry, M.L., Skjellum, A., Ward, H.L., Brightwell, R.: Arbitrary Dimension ReedSolomon Coding and Decoding for Extended RAID on GPUs. In: Proc. 3rd Petascale Data Storage Workshop - PDSW 08 (2008)
4. Curry, M.L., Skjellum, A., Ward, H.L., Brightwell, R.: Accelerating Reed-Solomon
coding in RAID systems with GPUs. In: IPDPS2008, pp. 1-6, IEEE Press, http://
www.bibsonomy.org/bibtex/2e7f39d74179b4d96fea4d89df77c5d6b/dblp (2008)
5. Fatahalian, K., Houston, M.: GPUs: A Closer Look. Comm. ACM 51, 5057 (2008)
6. Kher, V., Kim, Y.: Securing Distributed Storage: Challenges, Techniques, and Systems. In: ACM Workshop on Storage Security and Survivability, pp. 925 (2005)
7. MacKay, D.J.C.: Fountain Codes. IEE Proc. Communications 152 (6), 10621068
(2005)
8. Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable Parallel Programming
with CUDA. Queue 6 ( 2), 4053 (2008)
9. OpenCL - The open standard for parallel programming of heterogeneous systems,
http://www.khronos.org/opencl
10. Plank, J.: A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like
systems. Software Practice & Experience 27 (9) 9951012 (1997)
11. Plank, J., Ding, Y.: Note: Correction to the 1997 tutorial on Reed-Solomon coding.
Software Practice & Experience 35 (2) 189194 (2005)
12. Plank, J., Simmerman, S., Schuman, C.: Jerasure: A Library in C/C++ Facilitating Erasure Coding for Storage Applications, https://www.cs.utk.edu/~ plank/
plank/papers/CS-08-627.pdf
13. Plank, J., Thomason, M.: A Practical Analysis of Low-Density Parity-Check Erasure Codes for Wide-Area Storage Applications. In: Proc. 2004 Int. Conf. Dependable Systems and Networks, pp. 115124, IEEE Comp. Soc. (2004)
14. Plank, J., Xu, L.: Optimizing Cauchy Reed-Solomon codes for fault-tolerant network storage applications. In: NCA-06: 5th IEEE Int. Symp. Network Computing
Applications, pp.173180 (2006)
15. Weatherspoon, H., Kubiatowicz, J.: Erasure Coding vs. Replication: A Quantitive
Comparison, Proc. IPTPS02, pp. 328338 (2002)
16. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization
of sparse matrix-vector multiplication on emerging multicore platforms. Parallel
Computing 35, pp. 178194 (2009)
17. Wozniak, M., Olas, T., Wyrzykowski, R.: Parallel Implementation of Conjugate
Gradient Method on Graphics Processors. Lect. Notes in Comp. Sci. 6067, 125
135 (2010)
18. Wyrzykowski, R, Kuczynski, L.: Towards Secure Data Management System for
Grid Environment Based on the Cell Broadband Engine. Lect. Notes in Comp.
Sci. 4967, 825834 (2008)
19. Wyrzykowski, R. Kuczynski, L., Rojek, K.: Mapping AES Cryptography and
Whirlpool Hashing onto Cell/B.E. Architecture. In: Proc. PARA 2008 (2010) (accepted for publication)

Towards Efficient Execution of Erasure Codes On Multicore Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Towards Efficient Execution of Erasure Codes On Multicore Architectures

Uploaded by

Copyright:

Available Formats

Towards Efficient Execution of Erasure Codes on

Abstract. Erasure codes can improve the availability of distributed

Keywords: Erasure codes, Reed-Solomon codes, Cauchy Reed-Solomon codes,

Efficient Execution of Erasure Codes on Multicore Architectures

Efficient Execution of Erasure Codes on Multicore Architectures

sible to program GPUs in high-level languages. The new software developments,

Reed-Solomon Codes and Linear Algebra Algorithms

where the inverse matrix 1

Mapping Reed-Solomon Erasure Codes and Their

In our investigation, we focus on mapping the following expression:

Efficient Execution of Erasure Codes on Multicore Architectures

To compute each of these multiplications within a corresponding SPE core using

Efficient Execution of Erasure Codes on Multicore Architectures

Mapping Cauchy Reed-Solomon Codes

ci,k := ci,k fi,k,j,l

Efficient Execution of Erasure Codes on Multicore Architectures

allows us to avoid operations with nonzero coefficients fi,k,j,l

Performance Results on Cell/B.E. Processor

Such a high value of the bandwidth bRS

Using Cauchy Reed-Solomon Codes

Using the open-source Jerasure library [12] for m, n = 4, w = 3, we generate

Table 1. Performance results (number LC of clock cycles) for different variants of

Efficient Execution of Erasure Codes on Multicore Architectures

Implementing CRS Codes on Nvidia Tesla C1060 GPU

The CUDA programming environment [8] makes it possible to develop parallel

Efficient Execution of Erasure Codes on Multicore Architectures

Mapping CRS Codes onto GPU Architecture

ci,k := ci,k fi,k,j,l

Efficient Execution of Erasure Codes on Multicore Architectures

Performance Results on Nvidia Tesla C1060

Efficient Execution of Erasure Codes on Multicore Architectures

In general, the advantage of using a GPU as an accelerator against a solely

Erasure codes can radically improve the availability of distributed storage in

Efficient Execution of Erasure Codes on Multicore Architectures

You might also like