Thursday, August 10, 2017

Implementing nested loop in OpenCL?

Leave a Comment

I'm new to OpenCL, been trying to implement a 3 level nested loop in Kernel function. Guess my understanding is not enough. Below is the C code of the logic

void scale(float *output, float *scales, int batch, int n, int size) {     int i,j,b;     for(b = 0; b < batch; ++b){         for(i = 0; i < n; ++i){             for(j = 0; j < size; ++j){                 output[(b*n+i)*size+j] *= scales[i];             }         }     } } 

Where output and scales are 1D arrays. Ex:

float output[18] = {1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9}; float scales[9] = {1,0,1,0,1,0,1,0,1};  int n = 9; int size = 2; int batch = 1; 

The expected output is Output:

1.000000  2.000000  0.000000  0.000000  5.000000  6.000000   0.000000  0.000000  9.000000  1.000000  0.000000  0.000000  4.000000  5.000000  0.000000  0.000000  8.000000  9.000000 

Below is my OpenCL kernel

__kernel void scale_kernel(__global float *output, __global float *biases, int n, int size) {     int j = get_global_id(0);     int i = get_group_id(1);     int b = get_group_id(2);      if(j < size) output[(b*n+i)*size + j] *= biases[i]; } 

I hope this implementation is correct and the way I'm launching the NDkernel is wrong. My BLOCK size is 16 (Think this is where my understanding is wrong).

size_t global_work_size[3] = {size-1)/BLOCK + 1, n, batch}; size_t local_work_size[3] = {BLOCK, 1, 1}; cl.error = clEnqueueNDRangeKernel(queue, kernel, 3, 0, global_work_size, local_work_size, 0, 0, NULL); 

EDIT 1:

Changing the global_work_size as below produces the expected output, I've set local_work_size as NULL in this case. This might not provide the best performance.

size_t global_work_size[3] = {size, n, batch};     cl.error = clEnqueueNDRangeKernel(queue, kernel, 3, 0, global_work_size, NULL, 0, 0, NULL); 

Please let me know how to choose global_work_size , local_work_size.

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment