On Tue, Apr 5, 2011 at 1:27 PM, Andreas Kloeckner
In principle, I'm open to the idea of a CUDA/CL
abstraction layer. I
don't think this would have a measurable performance impact, but we
*should* back up that claim that if we decide to make one.
What kept me from making one is that the two sets of launch logic are
sufficiently non-alike (CUDA dealing with cache size bits, CL adapting
to Apple specifics, CL dealing with wait_fors (it doesn't yet, but it
needs to)...) that I thought the having two different invocation codes
might not be such a horrible idea. Perhaps the right recipe would be to
try and gradually move the two codes towards using a common set of
functions, before trying to go whole-hog and unify them entirely.
I also noticed that you have two sets of launch logic in PyFFT--what was
your reasoning behind that?
Pyfft has two thin wrappers for Cuda/CL contexts, modules and kernels,
just to avoid writing constructions like this one everywhere:
Of course, in pyfft case all kernel invocations are quite similar,
which allowed me to create the simple abstraction. I guess it will not
be that easy in general case.
That might not be such a bad idea. It makes us rely
more heavily on
Python's dependency handling, but I suppose that's ok. I've registered a
PyPI project for this:
(fairly empty) git tree at:
(I'm shooting for, e.g., 'compyte.primitives.scan'. :)
I'll probably release the 2011.1 versions of PyCUDA and PyOpenCL without
this, but we can look at unifying this afterwards.
Yes, I think I'll try to implement RNG in compyte and see how this
goes. By the way, it would be nice to have this repo mirrored on
github; I think it provides quite a convenient interface for