james.bergstra at gmail.com
Fri Nov 20 06:20:11 PST 2009
Now that we're taking more advantage of PyCUDA's and CodePy's ability
to generate really precise special-case code... I'm finding that we
wind up with a lot of ambiguities about *which* generator should
handle a given special case. The right choice for a particular input
structure is platform-dependent--a function of cache sizes, access
latencies, transfer bandwidth, register counts, number of processors,
etc, etc. The wrong choice can carry a big performance penalty.
FFTW and ATLAS get around this by self-tuning algorithms, which I
don't understand in detail, but which generally work by trying a lot
of generators on a lot of special cases, and then using the database
of timings to make good choices quickly at runtime.
It seems like this automatic-tuning is even more important for GPU
implementations than for CPU ones. Are there libraries to help with
More information about the PyCUDA