How does a barrier work for OpenCl Kernel? -
kernel code:
#pragma opencl extension cl_khr_fp64: enable #pragma opencl extension cl_amd_printf : enable __kernel void calculate (__global double* in) { int idx = get_global_id(0); // statement 1 printf("started %d workitem\n", idx); // statement 2 in[idx] = idx + 100; // statement 3 printf("value changed %lf in %d workitem\n", in[idx], idx); // statement 4 barrier(clk_global_mem_fence); // statement 5 printf("completed %d workitem\n", idx); // statement 6 }
i calling kernel using clenqueuendrangekernel, passing argument of array of double datatype having 5 elements value initialized 0.0
i calling kernel 5 global_work_size, hence each element of array solve on each workitem.
but per theoritical understanding of barriers, synchronize work-items in work-group, opencl provides similar capability barrier function. forces work-item wait until every other work-item in group reaches barrier. creating barrier, can make sure every work-item has reached same point in processing. crucial concern when work-items need finish computing intermediate result used in future computation.
hence, expecting output like:
started 0 workitem started 1 workitem value changed 100.000000 in 0 workitem value changed 101.000000 in 1 workitem started 3 workitem value changed 103.000000 in 3 workitem started 2 workitem value changed 102.000000 in 2 workitem started 4 workitem value changed 104.000000 in 4 workitem completed 3 workitem completed 0 workitem completed 1 workitem completed 2 workitem completed 4 workitem
these completed statements, come @ end because of barrier restrict other work items till reaching point.
but, result getting,
started 0 workitem value changed 100.000000 in 0 workitem completed 0 workitem started 4 workitem value changed 104.000000 in 4 workitem completed 4 workitem started 1 workitem started 2 workitem started 3 workitem value changed 101.000000 in 1 workitem value changed 103.000000 in 3 workitem completed 3 workitem value changed 102.000000 in 2 workitem completed 2 workitem completed 1 workitem
am missing in logic? then, how barrier work opencl kernel?
added more checks in kernel cross checking updated values after barrier instead of print statements.
#pragma opencl extension cl_khr_fp64: enable #pragma opencl extension cl_amd_printf : enable __kernel void calculate (__global double* in) { int idx = get_global_id(0); in[idx] = idx + 100; barrier(clk_global_mem_fence); if (idx == 0) { in[0] = in[4]; in[1] = in[3]; in[2] = in[2]; in[3] = in[1]; in[4] = in[0]; } }
then after array should
after arr[0] = 104.000000 after arr[1] = 103.000000 after arr[2] = 102.000000 after arr[3] = 101.000000 after arr[4] = 100.000000
but results, getting:
after arr[0] = 0.000000 after arr[1] = 101.000000 after arr[2] = 102.000000 after arr[3] = 103.000000 after arr[4] = 104.000000
the code looks fine, doubt size of local work-group, if have not specified local work-group size, opencl compiler chooses best based on checks (and one).
check clenqueuendrangekernel call w.r.t below call
size_t global_item_size = 5; //specifies no. of total work items size_t local_item_size = 5; // specifies no. of work items per local group clenqueuendrangekernel( command_queue, kernel, 1, null, &global_item_size, &local_item_size, 0, null, null );
note: this answer assumption either have not specified local work group size or not set per requirement.
little more on work group::
barrier block threads in work group, have not specified work group size (its size considered one) , have 5 work groups each having 1 thread.
Comments
Post a Comment