lists at informa.tiker.net
Fri Nov 20 07:08:36 PST 2009
On Freitag 20 November 2009, James Bergstra wrote:
> Now that we're taking more advantage of PyCUDA's and CodePy's ability
> to generate really precise special-case code... I'm finding that we
> wind up with a lot of ambiguities about *which* generator should
> handle a given special case. The right choice for a particular input
> structure is platform-dependent--a function of cache sizes, access
> latencies, transfer bandwidth, register counts, number of processors,
> etc, etc. The wrong choice can carry a big performance penalty.
> FFTW and ATLAS get around this by self-tuning algorithms, which I
> don't understand in detail, but which generally work by trying a lot
> of generators on a lot of special cases, and then using the database
> of timings to make good choices quickly at runtime.
What has worked well for me is to try a big bunch of kernels right before
their intended use and cache which one was fast for this special case only.
The main delay is the compilation of all these kernels, the trial runs are all
very quick, thanks to the GPU. There's just enough caching at each level to
make this efficient.
> It seems like this automatic-tuning is even more important for GPU
> implementations than for CPU ones.
That certainly echos one claim from the PyCUDA article. :)
> Are there libraries to help with this?
First of all, since it's a thorny (and unsolved) problem, PyCUDA doesn't try
to get involved in it. Support it--yes, involved--no. That said, I'm not aware
of libraries that make autotunig significantly easier. Nicolas mentioned that
he's eyeing some machine learning techniques like the ones in Milepost gcc.
Nicolas, care to comment? Aside from that, Cray's "grouped, attributed
orthogonal search"  sounds useful.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 190 bytes
Desc: This is a digitally signed message part.
More information about the PyCUDA