The raw computing power of any modern GPU is vastly superior to a modern CPU. However, you need to keep in mind that GPUs are devices that run asynchronously from the CPU that is running the Python interpreter and also they have their own physically separate memory (this is true at least on desktop PCs).

This two restrictions put together mean that there's a significant overhead associated with doing any brief computation on the GPU. You need to consider the amount of data that is being transferred from the CPU's RAM into the GPU's RAM and compare it with the time that the computation itself is going to take. If all you are doing is doing a component-wise vector addition, the cost of moving data around is going to be greater than the cost of the actual ALU instructions, which is why you are seeing some disappointing performance.

Reduce the amount of communication and synchronization between the CPU and the GPU and then you will see what these devices can do.



On Sat, Feb 13, 2010 at 8:41 AM, Sven Schreiber <> wrote:
Sven Schreiber schrieb:
> Andrew Straw schrieb:
>> Sven Schreiber wrote:

>>> Yes, absolutely! So really my question was meant as:
>>> Ubuntu 9.10 and Nvidia SDK howto?
>> The examples are working for me on Karmic with the attached,
>> but I haven't gone any further. I'm using amd64 arch, the 195.30 beta
>> drivers, and a GeForce GTX 260.
> Thanks, I will keep the possibility in mind to upgrade to the beta
> drivers. However, searching for "nvidia sdk gcc 4.4" I found some
> instructions how to get the cuda sdk up and running on ubuntu 9.10. I'll
> try these soon and probably report back here to leave some hints for
> future readers with the same problem.

Ok, so I have pyopencl-0.91.4 as well as pycuda-0.93 now up and running

The combination is (short version):
* Driver 195.30 beta
* gcc symlink pointing to gcc-4.4 for compiling the driver kernel
module, but pointing to gcc-4.3 for the rest
* (for the Cuda stuff, following the advice in; and for
Nvidia's OpenCL examples I also changed the CXX, CC, and LINK lines in

I had problems with the 190.29 drivers, and while pycuda worked with the
190.53 drivers, (py)opencl didn't -- I guess the latter is expected. So
for me indeed only the 195.30 beta drivers seem to work with both.

BTW, a remark about the example file. I think the speed
comparison there is a little biased in favor of pyopencl. It compares
(almost) pure Python with pyopencl, but IMHO the more meaningful
comparison would be between Numpy vectorized code and pyopencl. AFAICS
the numpy equivalent of the pure Python code would be:

for j in range(1000): # number of iterations, just for comparability
       n_result = (a+b)**2 * (a/2.0)

At least the results seem to agree when checked afterwards. On my test
system I get the following timings:
* pure Python: 20.85s
* vectorized Numpy on CPU: 0.044s
* pyopencl on GPU: 0.034s

Of course I'm *not* saying that the pyopencl approach isn't fast and
useful. (My test graphics card is very low end and is on the slow PCI
bus.) But the first one or two orders of magnitude can be achieved
already without any GPU magic.

thank you for these very cool tools,

PyOpenCL mailing list