On Sun, 13 Mar 2011 16:04:06 -0700 (PDT), elafrit <afrit.mariem(a)gmail.com> wrote:
I woder if I can ameliorate the pycuda code by editing
the number of maximum
threads in the gpuarray.py ?
The only way to find out is to try. If you do find a way to improve the
speed, please do let the list know.
I imagine that a better approach might be to try and introduce some
instruction-level parallelism. (or at least create some wiggle room for
the insn scheduler in ptxas) That, unfortunately, is sort of difficult.
And I can't understand what's really happening
when I use the methods of
gpuarray to multiply a matrix with a scalar ? Is the scalar sent to the GPU
for each element of the matrix or it's sent only the first time ? And is it
sent as scalar or as gpuarray ?
CPU scalars are sent as kernel parameters, which is a fairly efficient
way of broadcasting to all thread blocks.