I'm not sure if it is the right place, but since it is so elementary,
I just appreciate some explanation.
So in every CUDA tutorial example, e.g., to double each element in an
array, in kernel function, we have the following lines:
int idx = // calculate a unique value for each thread
if (idx < N) // N is the number of elements of an array
a[idx] *= 2;
"if branch" is a rather expensive operation, why do we want each
thread to go for this check?
Since on each device, only one kernel function is allowed to evaluate
at a time, why don't we let each thread double its own associated
value, and afterwards we simply copy N elements back to the host.
Basically, we just omit the "if" check, and go for the "double
It seems this approach is more straightforward.
Do I miss anything?
Department of Electrical and Computer Engineering
Montreal, QC, Canada
Show replies by date