Professional Documents
Culture Documents
2011
4719
CPU GPU
LIU Jinfeng
GUO Lei
710129
School of AutomationNorthwestern Polytechnical University
Xian 710129China
LIU JinfengGUO Lei.Comparison and analysis of matrix multiplications on GPU and CPU.Computer Engineering
and Applications201147199-11.
Abstract Three matrix multiplications on CPU and four CUDA-based matrix multiplications on GPU are describedthe
causes of high performance are analyzed and the common characteristic of efficient algorithm is that data are properly organized and rationally utilizedand therefore the access cost effectively reduced and the speed is greatly improved.The best optimized implementation on CPU gain more 200 times fast than the common onethe best optimized implementation on GPU
gain about 6 times fast than the best one on CPU.
Key wordsmatrix multiplicationCompute Unified Device ArchitectureCUDA
Graphic Processing UnitGPUstorage pattern
CPU
200 GPU CPU 6
DOI10.3778/j.issn.1002-8331.2011.19.003
1002-8331201119-0009-03
TP301
GPU NVIDIA
GPGPU CUDA[2] 4
GPU CUDA
GPU GPU
1
2
CPU GPU
NN A
B
C=A*BAB C
[1]
CPU
CPU
Intel MKL
Cache
GPU
2 CPU
2.1
20
80 CPU
CPU Cache
Cache 1
GPU
Cache
GPU
CPU
[3]
1971
GPU
1956
E-mail
ljf@sjtu.org
2011-02-282011-04-26
10
20114719
Cache
Cache
Cache
TLB
2.2
TLB
TLB TLB
ON3AB
TLB
TLB MKL
Cache Cache
Cache TLB
2MKL
for i=0i<Ni++ {
for j=0j<Nj++) {
float Ctemp=0
3 CPU
for k=0k<Nk++
Ctemp+=A[i*N+k]*B[k*N+j]
C[i*N+j]=Ctemp
3 GPU CUDA
3.1 CUDA GPU
}
}
2.3
Cache
[4]
6464
1 0241 024 4
CUDA
Host Device Host
CPU Device GPU Device Kernel
GPU Host
int ijk
for int jj=0jj<Njj=jj+BLOCK {
Device Host
// BLOCK
CUDA
threadblock
for i=0i<Ni++){
for j=jj
j<minjj+BLOCKN
j++ {
float Ctemp=0
for k=kkk<minkk+BLOCK
Nk++
t+=A[i*N+k]*B[k*N+j]
C[i*N+j]+=Ctemp
} } } }
register local
memory
shared memory
global memoryconstant memory
texture memory
CUDA
2.4
CPU
Intel MKL
Intel MKlMath Kernel Library
[2]
1
Cache
FFT
BLAS BLAS1BLAS2BLAS3
GPU
cblas_sgemmCblasRowMajorCblasNoTransCblasNoTransN
NN1ANBN0C
N
MKL
60 200
[35]MKL
2
SM
CUDA
3 GPU
CPU
CPU GPU
CUDA GPU
CUDA
[6-7]
3.2
GPU
Kernel 4
C C A
N3+N2
extern__shared__float data[]
const int row=blockIdx.x
for i=tidi<Ni+=blockDim.x
data[i]=A[row*lda+i]
//A shared memory__syncthreads
Ctemp=0;
for j=tidj<Nj+=blockDim.x
Ctemp+=data[i]*B[i*ldb+j]
C[row*ldc+j]=Ctemp
GPU
GPU 1/20
A
B
3.3
GPU
Kernel 5
K
2
3
2KN/K
=2N3/K
K BLOCK_SIZE 16
1616
N/16
N/16
Ctemp=0
__shared__float AS[BLOCK_SIZE][BLOCK_SIZE]
__shared__float BS[BLOCK_SIZE][BLOCK_SIZE]
for {
// shared memory
As[ty][tx]=A[indexA]
Bs[ty][tx]=B[indexB]
indexA+=BLOCK_SIZE
indexB+=widthB*BLOCK_SIZE
_syncthreads
//
for i=0i<BLOCK_SIZEi++
Ctemp+=As[ty][i]*Bs[i][tx]
//
_syncthreads
}
C[indexC]=Ctemp
GPU
1
11
GPU
for CUDA
GPU
5
for
Ctemp +=As[ty] [0]*Bs[0] [tx] + As[ty] [1]*Bs[1] [tx] ++ As[ty] [15]*
Bs[15][tx]
for 5
60%
3.5
3.4
2011
4719
GPU
cublasSgemm
n
n
N
N
N
1
A
N
B
N
0
C
N
CUBLAS2.1
[8]
CUDA
SM8
32
1.836 GHz
70.4 GB/s
GPU470 GFLOPS
CUDA 2.1
N=1 024 1
GFLOPS
6
N GFLOPS
CPU
CPU MKL GPU
CPU
GFLOPS
CPU
GPU CUBLAS CPU
Intel MKL 6
CUDA GPU Kernel
GPU
1 6
N=1 024
CPU
GPU
Intel MKL
/GFLOPS
0.1
0.5
35
21
58
96
209
/%
0.2
1.1
78
4.5
12
20
44
23