Professional Documents
Culture Documents
Home
About Me
Programming/Coding
Reviews
My Photos
Home > Programming, Technology > OpenCL GPU Matrix multiplication program
1
2
__kernel
3 void matrixMultiplication(__global float* A, __global float* B, __global
4 float* C, int widthA, int widthB )
5 {
6 int i = get_global_id(0);
int j = get_global_id(1);
7 float value=0;
8 for ( int k = 0; k < widthA; k++)
9 {
1 value = value + A[k + j * widthA] * B[k*widthB + i];
0 }
C[i + widthA * j] = value;
11}
1
2
Advertisements
Share this:
Facebook
Print
Related
1.
Ivan
Reply
Could you please upload code for matrix inversion!!!! or give at least a hint!!!
2.
KOnark PAtel
Reply
thanks for this code. i have a query. when i used few days back this code on my w8,
visual studio, it was working well. but since today the program runs too slowly for
executing this lines
float p=(rand()%100)/7.0;
*(A+i*heightA+j)=rand()%100 + p;
fprintf(fp1, %f ,*(A+i*heightA+j));
and also program is not finishing. command console just remains open. can u suggest any
solution?
3.
paweln66
Reply
Useful solution. In line 45 and 62 you have a bug missing elements of for loop.
Result
sh: pause: command not found
Result
sh: pause: command not found
4.
Alex
Reply
Cool code thx. Hovewer I tried to implement it in vs 2012 and got two errors.
1.First:
for(int i = 0;i < widthA; i++)
{
for(int j=0;j { //I think you forgot here the hightA
float p=(rand()%100)/7.0;
*(A+i*heightA+j)=rand()%100 + p;
fprintf(fp1, "%f ",*(A+i*heightA+j));
}
fprintf(fp1, "\n");
}
2.Second:
for(int i = 0;i < widthB; i++)
{
for(int j=0; j { //Here the same thing
float p=(rand()%100)/7.0;
*((B+i*heightB+j))=rand()%100 + p;
fprintf(fp1, "%f ",*(B+i*heightA+j));
}
fprintf(fp1, "\n");
}
New code
5KK73
GPU
assign
ment
websit
e
2014/2
015
Home Matrix multiplication in OpenCL
This document describes a matrix multiplication example application
Administrati
using OpenCL for Nvidia GPUs, the focus will be on the code structure
on
for the host application and the OpenCL GPU kernels. For examples of
Request
optimization matrix multiplication please refer to the CUDA example
server
documentation, most CUDA kernels will be very similar in a OpenCL
account
implementation. This example can be found here. The source code
Announceme
for the OpenCL matrix multiplication example can be found here.
nts
Host code
Resources
Slides The host code initializes the OpenCL capable GPUs, allocates and
5KK73 forum transfers memory and executed the OpenCL kernel.
Examples The code shown below declares OpenCL memories which will be
Matrix instantiated on the device, hence the prefix 'd_'. The A and B
multiplication memories are two input matrices of size 1024x1024, C is the result
matrix. Since the memory described above is on the device we also
- OpenCL need to declare and allocate memory on the host, in this case the
Matrix server, and fill the input arrays with values. This is done by the
Multiplication radomInit() function.
- CUDA
// OpenCL device memory for matrices
cl_mem d_A;
Assignment/ cl_mem d_B;
competition cl_mem d_C;
Mining
application // set seed for rand()
Assignment srand(2014);
guidelines
//Allocate host memory for matrices A and B
Competition
unsigned int size_A = WA * HA;
Score board unsigned int mem_size_A = sizeof(float) * size_A;
Submit float* h_A = (float*) malloc(mem_size_A);
cl_uint dev_cnt = 0;
clGetPlatformIDs(0, 0, &dev_cnt);
cl_platform_id platform_ids[100];
clGetPlatformIDs(dev_cnt, platform_ids, NULL);
Once a OpenCL context and command queue are defined the OpenCL
kernel can be loaded. In OpenCL kernels are typically loaded are
runtime and compiled by the function clBuildProgram. In order to do
this the actual kernel is loaded by the function LoadOpenCLKernel
and transformed into an OpenCL program description with the
clCreateProgramWithSource function. The built kernel description will
then be made ready for execution by the clCreateKernel function. Be
aware that the second argument should match the name of the
kernel as descibed in the .cl file.
printf("%s\n", buffer);
exit(1);
}
Now the kernel is ready for execution the buffers on the compute
device (in our case the GPU) should be allocated, this is done with
the clCreateBuffer function, the arguments of this function can be
used to describe if a memory is read-only, write-only or read-write.
Specifying this correct can help to increase performance. The
function clSetKernelArg links the allocated memory space in the GPU
to the arguments of the kernel, in our case the A,B and C matrices
and two integers specifying the width of the matrices.
int wA = WA;
int wC = WC;
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&d_C)
;
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&d_A
);
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), (void *)&d_B
);
err |= clSetKernelArg(kernel, 3, sizeof(int), (void *)&wA);
err |= clSetKernelArg(kernel, 4, sizeof(int), (void *)&wC);
if (err != CL_SUCCESS)
{
printf("Error: Failed to set kernel arguments! %d\n", err
);
exit(1);
}
localWorkSize[0] = 16;
localWorkSize[1] = 16;
globalWorkSize[0] = 1024;
globalWorkSize[1] = 1024;
if (err != CL_SUCCESS)
{
printf("Error: Failed to execute kernel! %d\n", err);
exit(1);
}
if (err != CL_SUCCESS)
{
printf("Error: Failed to read output array! %d\n", err);
exit(1);
}
GPU code
The OpenCL kernel is very similar in structure to a CUDA kernel, with
some small differences. The external memory is described with
__global and shared memory is described with __local, whereas this
would be called shared memory in CUDA. Additionally a similar
structure to CUDA is used for determining the thread id. This can be
done via the get_global_id function which works for multiple
dimensions. The return values of this function can be used to
determine the matrix location to read for calculation. Due to the
similar structure between CUDA and OpenCL many of the
optimizations described in the CUDA matrix multiplication example
can be applied to the OpenCL version without too many
modifications.
/* kernel.cl
* Matrix multiplication: C = A * B.
* Device code.
*/
// OpenCL Kernel
__kernel void
matrixMul(__global float* C,
__global float* A,
__global float* B,
int wA, int wB)
{
int tx = get_global_id(0);
int ty = get_global_id(1);