You are on page 1of 3

Computer Engineering and Applications

2011
4719

CPU GPU


LIU Jinfeng
GUO Lei

710129
School of AutomationNorthwestern Polytechnical University
Xian 710129China
LIU JinfengGUO Lei.Comparison and analysis of matrix multiplications on GPU and CPU.Computer Engineering
and Applications201147199-11.
Abstract Three matrix multiplications on CPU and four CUDA-based matrix multiplications on GPU are describedthe
causes of high performance are analyzed and the common characteristic of efficient algorithm is that data are properly organized and rationally utilizedand therefore the access cost effectively reduced and the speed is greatly improved.The best optimized implementation on CPU gain more 200 times fast than the common onethe best optimized implementation on GPU
gain about 6 times fast than the best one on CPU.
Key wordsmatrix multiplicationCompute Unified Device ArchitectureCUDA
Graphic Processing UnitGPUstorage pattern

CPU GPU CUDA

CPU
200 GPU CPU 6

DOI10.3778/j.issn.1002-8331.2011.19.003

1002-8331201119-0009-03

TP301

GPU GPGPU GPU

GPU NVIDIA

GPGPU CUDA[2] 4

GPU CUDA

GPU GPU

1
2

CPU GPU

NN A
B
C=A*BAB C

[1]

CPU

CPU
Intel MKL
Cache

GPU

2 CPU
2.1
20
80 CPU

CPU Cache
Cache 1

GPU

Cache

GPU

CPU

[3]

1971

GPU
1956

E-mail
ljf@sjtu.org
2011-02-282011-04-26

10

20114719

Computer Engineering and Applications


1

Cache
Cache

Cache

TLB

2.2

TLB

TLB TLB

ON3AB

TLB

TLB MKL

Cache Cache

Cache TLB

2MKL

for i=0i<Ni++ {

for j=0j<Nj++) {
float Ctemp=0

3 CPU

for k=0k<Nk++
Ctemp+=A[i*N+k]*B[k*N+j]
C[i*N+j]=Ctemp

3 GPU CUDA
3.1 CUDA GPU

}
}

2.3

CUDA NVIDIA GPU CUDA C


GPU

Cache
[4]

6464
1 0241 024 4

CUDA
Host Device Host
CPU Device GPU Device Kernel
GPU Host

int ijk
for int jj=0jj<Njj=jj+BLOCK {

Device Host

// BLOCK

CUDA

for int kk=0kk<Nkk=kk+BLOCK {

threadblock

for i=0i<Ni++){
for j=jj
j<minjj+BLOCKN
j++ {
float Ctemp=0
for k=kkk<minkk+BLOCK
Nk++
t+=A[i*N+k]*B[k*N+j]
C[i*N+j]+=Ctemp
} } } }

register local
memory
shared memory
global memoryconstant memory
texture memory
CUDA

2.4

CPU

Intel MKL
Intel MKlMath Kernel Library

[2]
1
Cache

FFT

BLAS BLAS1BLAS2BLAS3

GPU

MKL 10.2 CPU

BLAS3 Intel MKL-blas

cblas_sgemmCblasRowMajorCblasNoTransCblasNoTransN
NN1ANBN0C
N

MKL
60 200
[35]MKL

2
SM

CUDA

3 GPU
CPU

CPU GPU

CUDA GPU
CUDA
[6-7]

3.2

GPU
Kernel 4

C C A

N3+N2
extern__shared__float data[]
const int row=blockIdx.x
for i=tidi<Ni+=blockDim.x
data[i]=A[row*lda+i]
//A shared memory__syncthreads
Ctemp=0;
for j=tidj<Nj+=blockDim.x
Ctemp+=data[i]*B[i*ldb+j]
C[row*ldc+j]=Ctemp

GPU

GPU 1/20

A
B

3.3

GPU
Kernel 5

K
2
3
2KN/K
=2N3/K

K BLOCK_SIZE 16
1616
N/16
N/16
Ctemp=0
__shared__float AS[BLOCK_SIZE][BLOCK_SIZE]
__shared__float BS[BLOCK_SIZE][BLOCK_SIZE]
for {
// shared memory
As[ty][tx]=A[indexA]
Bs[ty][tx]=B[indexB]
indexA+=BLOCK_SIZE
indexB+=widthB*BLOCK_SIZE
_syncthreads
//
for i=0i<BLOCK_SIZEi++
Ctemp+=As[ty][i]*Bs[i][tx]
//
_syncthreads
}
C[indexC]=Ctemp

GPU
1

11

GPU

for CUDA
GPU
5
for
Ctemp +=As[ty] [0]*Bs[0] [tx] + As[ty] [1]*Bs[1] [tx] ++ As[ty] [15]*
Bs[15][tx]

for 5
60%

3.5

const int tid=threadIdx.x

3.4

2011
4719

GPU

CUBLAS NVIDIA CUDA


BLAS NVIDIA
GPU API CUDA CUBLAS
GPU
CUBLAS GPU
CUBLAS BLAS3

cublasSgemm
n

n
N
N
N
1
A
N
B
N
0
C
N
CUBLAS2.1
[8]
CUDA

CPU Intel Core2 duo E7400


L1 Cache 64 KBL2 Cache 3 MB 2.8 GHz
44.8 GFLOPS
GPU GeForce 9800 GTX+ 16 SM

SM8
32

1.836 GHz
70.4 GB/s
GPU470 GFLOPS
CUDA 2.1
N=1 024 1
GFLOPS

6
N GFLOPS
CPU
CPU MKL GPU
CPU
GFLOPS

CPU
GPU CUBLAS CPU
Intel MKL 6
CUDA GPU Kernel
GPU
1 6

N=1 024
CPU

GPU

Intel MKL
/GFLOPS

0.1

0.5

35

21

58

96

209

/%

0.2

1.1

78

4.5

12

20

44

23

You might also like