1. What happens without the branch if N is not a multiple of blockDim?
Without the branch, you will get a segmentation error.
2. Branches are not as expensive as you think. Memory reads and writes are
the most expensive things.
On Tue, Sep 27, 2011 at 8:08 AM, ericyosho <ericyosho(a)gmail.com> wrote:
I'm not sure if it is the right place, but since
it is so elementary,
I just appreciate some explanation.
So in every CUDA tutorial example, e.g., to double each element in an
array, in kernel function, we have the following lines:
int idx = // calculate a unique value for each thread
if (idx < N) // N is the number of elements of an array
a[idx] *= 2;
"if branch" is a rather expensive operation, why do we want each
thread to go for this check?
Since on each device, only one kernel function is allowed to evaluate
at a time, why don't we let each thread double its own associated
value, and afterwards we simply copy N elements back to the host.
Basically, we just omit the "if" check, and go for the "double
It seems this approach is more straightforward.
Do I miss anything?
Department of Electrical and Computer Engineering
Montreal, QC, Canada
PyCUDA mailing list