You may indeed get better results with the Intel OpenCL CPU
implementation - I got a 3.5x speedup compared to AMD's implementation,
though I suspect it could be due to more optimized implementation of
sin/cos functions, which would not help in your case.
As for not finding the intel opencl as a platform, I don't think the
PyOpenCL siteconfig matters - do you have the intel openCL library
declared in /etc/OpenCL/vendors/ ?
But as was said previously, your code really is limited by memory
transfers, you can't expect much acceleration (and on the GPU you really
need the coalesced memory transfers - like transferring blocks of 16x16
points and then working on them...).
Also, that "if" in the middle of your kernel is probably going to kill
parallelism, at least on the GPU (the compiler can't assume parallel
threads will go the same way, so execution would be serialized?).... The
way it's written in Python is much more friendly for parallelization, in
Show replies by date