David Garcia schrieb:
This two restrictions put together mean that there's a significant
overhead associated with doing any brief computation on the GPU. You
need to consider the amount of data that is being transferred from the
CPU's RAM into the GPU's RAM and compare it with the time that the
computation itself is going to take. If all you are doing is doing a
component-wise vector addition, the cost of moving data around is going
to be greater than the cost of the actual ALU instructions, which is why
you are seeing some disappointing performance.
David, I'm aware of the issues you mention, and I wasn't disappointed
about the timings. I just took the benchmark case as given which is
distributed with pyopencl; I didn't cook it up myself. Your comments
seem to imply that using another benchmark case may be more informative.
thanks for your feedback,