CPU与GPU上几种矩阵乘法的比较与分析

Computer Engineering and Applications
2011
4719
CPU GPU

LIU Jinfeng
GUO Lei

710129
School of AutomationNorthwestern Polytechnical University
Xian 710129China
LIU JinfengGUO Lei.Comparison and analysis of matrix multiplications on GPU and CPU.Computer Engineering
and Applications201147199-11.
Abstract Three matrix multiplications on CPU and four CUDA-based matrix multiplications on GPU are describedthe
causes of high performance are analyzed and the common characteristic of efficient algorithm is that data are properly organized and rationally utilizedand therefore the access cost effectively reduced and the speed is greatly improved.The best optimized implementation on CPU gain more 200 times fast than the common onethe best optimized implementation on GPU
gain about 6 times fast than the best one on CPU.
Key wordsmatrix multiplicationCompute Unified Device ArchitectureCUDA
Graphic Processing UnitGPUstorage pattern
CPU GPU CUDA
CPU
200 GPU CPU 6
DOI10.3778/j.issn.1002-8331.2011.19.003
1002-8331201119-0009-03
TP301
GPU GPGPU GPU
GPU NVIDIA
GPGPU CUDA[2] 4
GPU CUDA
GPU GPU
1
2
CPU GPU
NN A
B
C=A*BAB C
[1]
CPU
CPU
Intel MKL
Cache
GPU
2 CPU
2.1
20
80 CPU
CPU Cache
Cache 1
GPU
Cache
GPU
CPU
[3]
1971
GPU
1956
E-mail
ljf@sjtu.org
2011-02-282011-04-26
10
20114719
Computer Engineering and Applications

1
Cache
Cache
Cache
TLB
2.2
TLB
TLB TLB
ON3AB
TLB
TLB MKL
Cache Cache
Cache TLB
2MKL
for i=0i<Ni++ {
for j=0j<Nj++) {
float Ctemp=0
3 CPU
for k=0k<Nk++
Ctemp+=A[i*N+k]*B[k*N+j]
C[i*N+j]=Ctemp
3 GPU CUDA
3.1 CUDA GPU
}
}
2.3
CUDA NVIDIA GPU CUDA C

GPU
Cache
[4]
6464
1 0241 024 4
CUDA
Host Device Host
CPU Device GPU Device Kernel
GPU Host
int ijk
for int jj=0jj<Njj=jj+BLOCK {
Device Host
// BLOCK
CUDA
for int kk=0kk<Nkk=kk+BLOCK {
threadblock
for i=0i<Ni++){
for j=jj
j<minjj+BLOCKN
j++ {
float Ctemp=0
for k=kkk<minkk+BLOCK
Nk++
t+=A[i*N+k]*B[k*N+j]
C[i*N+j]+=Ctemp
} } } }
register local
memory
shared memory
global memoryconstant memory
texture memory
CUDA
2.4
CPU
Intel MKL
Intel MKlMath Kernel Library
[2]
1
Cache
FFT
BLAS BLAS1BLAS2BLAS3
GPU
MKL 10.2 CPU
BLAS3 Intel MKL-blas
cblas_sgemmCblasRowMajorCblasNoTransCblasNoTransN
NN1ANBN0C
N
MKL
60 200
[35]MKL
2
SM
CUDA
3 GPU
CPU
CPU GPU
CUDA GPU
CUDA
[6-7]
3.2
GPU
Kernel 4
C C A
N3+N2
extern__shared__float data[]
const int row=blockIdx.x
for i=tidi<Ni+=blockDim.x
data[i]=A[row*lda+i]
//A shared memory__syncthreads
Ctemp=0;
for j=tidj<Nj+=blockDim.x
Ctemp+=data[i]*B[i*ldb+j]
C[row*ldc+j]=Ctemp
GPU
GPU 1/20
A
B
3.3
GPU
Kernel 5
K
2
3
2KN/K
=2N3/K
K BLOCK_SIZE 16
1616
N/16
N/16
Ctemp=0
__shared__float AS[BLOCK_SIZE][BLOCK_SIZE]
__shared__float BS[BLOCK_SIZE][BLOCK_SIZE]
for {
// shared memory
As[ty][tx]=A[indexA]
Bs[ty][tx]=B[indexB]
indexA+=BLOCK_SIZE
indexB+=widthB*BLOCK_SIZE
_syncthreads
//
for i=0i<BLOCK_SIZEi++
Ctemp+=As[ty][i]*Bs[i][tx]
//
_syncthreads
}
C[indexC]=Ctemp
GPU
1
11
GPU
for CUDA
GPU
5
for
Ctemp +=As[ty] [0]*Bs[0] [tx] + As[ty] [1]*Bs[1] [tx] ++ As[ty] [15]*
Bs[15][tx]
for 5
60%
3.5
const int tid=threadIdx.x
3.4
2011
4719
GPU
CUBLAS NVIDIA CUDA

BLAS NVIDIA
GPU API CUDA CUBLAS
GPU
CUBLAS GPU
CUBLAS BLAS3
cublasSgemm
n
n
N
N
N
1
A
N
B
N
0
C
N
CUBLAS2.1
[8]
CUDA
CPU Intel Core2 duo E7400

L1 Cache 64 KBL2 Cache 3 MB 2.8 GHz
44.8 GFLOPS
GPU GeForce 9800 GTX+ 16 SM
SM8
32
1.836 GHz
70.4 GB/s
GPU470 GFLOPS
CUDA 2.1
N=1 024 1
GFLOPS
6
N GFLOPS
CPU
CPU MKL GPU
CPU
GFLOPS
CPU
GPU CUBLAS CPU
Intel MKL 6
CUDA GPU Kernel
GPU
1 6
N=1 024
CPU
GPU
Intel MKL
/GFLOPS
0.1
0.5
35
21
58
96
209
/%
0.2
1.1
78
4.5
12
20
44
23

CPU与GPU上几种矩阵乘法的比较与分析

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CPU与GPU上几种矩阵乘法的比较与分析

Uploaded by

Copyright:

Available Formats

Computer Engineering and Applications

CPU GPU CUDA

GPU GPGPU GPU

Computer Engineering and Applications

CUDA NVIDIA GPU CUDA C

for int kk=0kk<Nkk=kk+BLOCK {

MKL 10.2 CPU

BLAS3 Intel MKL-blas

const int tid=threadIdx.x

CUBLAS NVIDIA CUDA

CPU Intel Core2 duo E7400

You might also like