Forgot to cc the list. Darn. :)
---------- Forwarded Message ----------
Betreff: Re: [PyCUDA] broadcasting and strided data
Datum: Donnerstag 27 August 2009
Von: Andreas Klöckner <lists(a)informa.tiker.net>
An: James Bergstra <james.bergstra(a)gmail.com>
On Dienstag 25 August 2009, you wrote:
It probably requires the expertise of a few people to get the design
right, so I'm reluctant even to try to put a patch together. First,
it requires some changes to the data container. Some of the issues
that come up are:
- what should be the strides for broadcastable dimensions (I like 0,
but numpy does it differently)
Assigning a stride zero seems to be a good "simple" way, even though it seems
like that might waste some processor power on unneeded index math. How does
numpy do it?
- should strides be in data-type units or byte units
I find this somewhat irrelevant--for the kernels themselves, data-type units
are likely more useful, especially if texturing is used. For storage, looking
like numpy by using byte offsets might be the way to go. Since doing the
conversion on the host right ahead of the kernel invocation is easy and cheap,
I don't see why we can't have our cake and eat it, too. (see also next
- should strides and dimensions be stored in host memory, device
memory, or both (how/when should they be synchronized?)
Host memory seems to be the right place, as kernel parameters, originating
from there, are the only way by which a variable can be easily spread to each
thread, without incurring a global mem access penalty.
As the data structure gets more complicated, the kernels become more
complex too. My experience is that all kernels have to have a
"general" version that is pretty slow, and progressively, more and
more special cases get optimized.
I find it helpful to do things the other way around. Solve a rather special
case first, then generalize. Even incremental solutions are valuable.
Kernel code generators get bloated.
Deciding on the right complexity for the generators is definitely an issue.
Rome wasn't built in a day. Going about this incrementally and not rushing it
seems like a wise idea. You're not on your own.
How many kinds of kernels are there in PyCUDA right now? (Given
the same code-generator can produce many elementwise kernels, I mean
to count that as one *kind* of kernel.) How many things would break
if arrays were strided?
Two. Elementwise kernels and reduction kernels are the kinds currently
implemented. All of the GpuArray functionality is written in terms of these