How does a barrier work for OpenCl Kernel? -


kernel code:

#pragma opencl extension cl_khr_fp64: enable #pragma opencl extension cl_amd_printf : enable  __kernel void calculate (__global double* in) {     int idx = get_global_id(0); // statement 1     printf("started %d workitem\n", idx); // statement 2     in[idx] = idx + 100; // statement 3     printf("value changed %lf in %d workitem\n", in[idx], idx); // statement 4     barrier(clk_global_mem_fence); // statement 5     printf("completed %d workitem\n", idx); // statement 6 } 

i calling kernel using clenqueuendrangekernel, passing argument of array of double datatype having 5 elements value initialized 0.0

i calling kernel 5 global_work_size, hence each element of array solve on each workitem.

but per theoritical understanding of barriers, synchronize work-items in work-group, opencl provides similar capability barrier function. forces work-item wait until every other work-item in group reaches barrier. creating barrier, can make sure every work-item has reached same point in processing. crucial concern when work-items need finish computing intermediate result used in future computation.

hence, expecting output like:

started 0 workitem started 1 workitem value changed 100.000000 in 0 workitem value changed 101.000000 in 1 workitem started 3 workitem value changed 103.000000 in 3 workitem started 2 workitem value changed 102.000000 in 2 workitem started 4 workitem value changed 104.000000 in 4 workitem  completed 3 workitem completed 0 workitem completed 1 workitem completed 2 workitem completed 4 workitem 

these completed statements, come @ end because of barrier restrict other work items till reaching point.

but, result getting,

started 0 workitem value changed 100.000000 in 0 workitem completed 0 workitem started 4 workitem value changed 104.000000 in 4 workitem completed 4 workitem started 1 workitem started 2 workitem started 3 workitem value changed 101.000000 in 1 workitem value changed 103.000000 in 3 workitem completed 3 workitem value changed 102.000000 in 2 workitem completed 2 workitem completed 1 workitem 

am missing in logic? then, how barrier work opencl kernel?

added more checks in kernel cross checking updated values after barrier instead of print statements.

#pragma opencl extension cl_khr_fp64: enable #pragma opencl extension cl_amd_printf : enable  __kernel void calculate (__global double* in) {     int idx = get_global_id(0);     in[idx] = idx + 100;     barrier(clk_global_mem_fence);     if (idx == 0) {         in[0] = in[4];         in[1] = in[3];         in[2] = in[2];         in[3] = in[1];         in[4] = in[0];     } } 

then after array should

after arr[0] = 104.000000 after arr[1] = 103.000000 after arr[2] = 102.000000 after arr[3] = 101.000000 after arr[4] = 100.000000 

but results, getting:

after arr[0] = 0.000000 after arr[1] = 101.000000 after arr[2] = 102.000000 after arr[3] = 103.000000 after arr[4] = 104.000000 

the code looks fine, doubt size of local work-group, if have not specified local work-group size, opencl compiler chooses best based on checks (and one).

check clenqueuendrangekernel call w.r.t below call

size_t global_item_size = 5; //specifies no. of total work items size_t local_item_size = 5; // specifies no. of work items per local group clenqueuendrangekernel( command_queue, kernel, 1, null, &global_item_size,    &local_item_size, 0, null, null ); 

note: this answer assumption either have not specified local work group size or not set per requirement.

little more on work group::

barrier block threads in work group, have not specified work group size (its size considered one) , have 5 work groups each having 1 thread.


Comments

Popular posts from this blog

apache - Remove .php and add trailing slash in url using htaccess not loading css -

javascript - jQuery show full size image on click -