[PyCUDA] autotuning

James Bergstra james.bergstra at gmail.com
Fri Nov 20 06:20:11 PST 2009


Now that we're taking more advantage of PyCUDA's and CodePy's ability
to generate really precise special-case code... I'm finding that we
wind up with a lot of ambiguities about *which* generator should
handle a given special case.  The right choice for a particular input
structure is platform-dependent--a function of cache sizes, access
latencies, transfer bandwidth, register counts, number of processors,
etc, etc.  The wrong choice can carry a big performance penalty.

FFTW and ATLAS get around this by self-tuning algorithms, which I
don't understand in detail, but which generally work by trying a lot
of generators on a lot of special cases, and then using the database
of timings to make good choices quickly at runtime.

It seems like this automatic-tuning is even more important for GPU
implementations than for CPU ones.  Are there libraries to help with
this?

James
-- 
http://www-etud.iro.umontreal.ca/~bergstrj




More information about the PyCUDA mailing list