if you have been wondering why the matrix-multiply example shipped with
PyOpenCL shows sub-standard performance on Nvidia hardware, wonder no
longer. In anticipation of next week's SciPy conference, I've finally
fixed that, and it turned out to be (d'oh!) bank conflicts. Which is
odd, since the example was (at some point) derived from Nvidia's own SDK
example. Anyway, for me, matmul performance on the same hardware is now
comparable between CL and CUDA.
Just thought I'd let you know.
Show replies by thread