This may be a stupid question, but has anyone got any experience in
implementing a parallel prefix sum with pycuda? I am trying to replace a
portion of numpy code I had which was used to calculate the root mean square
difference between two images. Theres a few steps in this procedure: a new
array is created from the pairwise difference between the two source arrays.
Each element of this new array is squared, and finally the elements of the
new array are summed and divided by the array length in order to calculate
the average square error between the two source images. My numpy code looks
like this (hopefully the indentation is intact):
def rmse(first, second):
""" Returns the root mean square error between two image arrays. """
assert numpy.size(first) == numpy.size(second)
difference = numpy.diff(numpy.dstack(
return 1 - math.sqrt(numpy.mean(difference**2)) / 255.0
It's not hard to see how pycuda could be used to accomplish the first few
steps of calculating the squared pairwise difference between two source
arrays: I was thinking that I would upcast both image arrays to floats,
since I would be unable to accurately square or subtract them from each
other as 8-bit unsigned integers. Off the top of my head, the kernel might
start looking like this
mod = SourceModule("""
__global__ void diff_and_square(float *dest, float *a, float *b)
const int i = threadIdx.x;
dest[i] = a[i]*a[i] + b[i]*b[i] - (2 * a[i] * b[i]);
but I'm stuck on the last step: summation.
I've read through this page on implementing parallel prefix sum with CUDA:
It's a pretty good resource for understanding how to implement a prefix scan
while avoiding bank conflicts and remaining work efficient. Unfortunately,
the source code offered for download on that page is no longer hosted, it
seems that its been replaced with similar functionality in the CUDA Data
Parallel Primitives library- I was trying to skim through this code to see
if it might be possible to plug it in and it seems quite complicated, using
a plan manager class to control the behaviour of the prefix scan in a very
I was hoping someone might be able to offer some advice on the most
painless way to accomplish this with pycuda. Thanks in advance!
Has anyone tried using pycuda with a library of GPU code such as CUDPP
 or chag:pp ? How do you use these in pycuda? Is it just a matter
of passing some header and lib file directories to nvcc via the
options=... keyword for SourceModule?
If you are attending Nvidia's GPU Technology Conference next week, there are
two things I'd like to point out:
- I'll be giving a talk about PyCUDA on Friday, October 2 at 2pm, where I'll
both introduce PyCUDA and talk about some exciting new developments. The talk
will be 50 minutes in length, and I'd be happy to see you there.
- Also, I'd like to propose a PyCUDA meetup on Thursday, October 1 at noon.
(ie. lunchtime) I'll be hanging out by the "Terrace" seminar room around that
time. I'm looking forward to meeting some of you in person.
See you next week,
After suffering though the linkage problems in the CUDA sdk and the
MacPorts boost and fixing that stuff with help from this mailing list,
I've run into another roadblock. When I run test_driver.py I get this
$ python2.5 test_driver.py
Traceback (most recent call last):
File "test_driver.py", line 25, in <module>
assert isinstance(pycuda.autoinit.device.get_attributes(), dict)
macosx-10.5-i386.egg/pycuda/driver.py", line 51, in
for att in dir(device_attribute)
macosx-10.5-i386.egg/pycuda/driver.py", line 52, in <genexpr>
pycuda._driver.LogicError: cuDeviceGetAttribute failed: not found
I have verified that the CUDA installation is working, because the
NVIDIA sample apps work. The NVIDIA deviceQuery program gives this
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "GeForce 9400M"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 1
Total amount of global memory: 266010624 bytes
Number of multiprocessors: 2
Number of cores: 16
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 0.00 GHz
Concurrent copy and execution: No
Run time limit on kernels: Yes
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple
host threads can use this device simultaneously)
Press ENTER to exit...
One potential problem is that I installed the latest CUDA package,
2.3. Does PyCUDA work with 2.3? The documentation implies that it will.
Should I downgrade to CUDA Driver 2.2? How do I uninstall the 2.3
I'm not sure what else to try at this point. Any ideas?
We're generating a lot of data using our PyCuda simulations, and we've
gotten to the point where analysis of the simulation output takes far more
time than generating said output. I was thinking of collecting PyCuda
accelerated statistics / spectral analysis functions into one place for
convenience. Hopefully most of these would just be a few lines long and
define operations in terms of the existing PyCuda base. I was just wondering
if anything like this already existed ?
p.s.: Does PyCuda have access to the Cuda FFT implementation ? I think this
was answered before on list, and the answer is no. Is there any GPU
accelerated FFT that is easily accessible from PyCuda ?
Do the suggestions from this post earlier in the summer still remain
valid? I do a lot of 2D (array, sub-array) calcs in R and trying to
get up to speed on the interchange between numpy indexing & PyCuda
processing. TIA, V.
Date: Tue, 16 Jun 2009 16:18:42 -0400
From: Andreas Kl?ckner <lists(a)informa.tiker.net>
Subject: Re: [PyCUDA] Pointer arithmetic
Content-Type: text/plain; charset="iso-8859-1"
On Dienstag 16 Juni 2009, Andrew Wagner wrote:
> Suppose I have a column-major array stored in linear memory on the
> gpu, and want to run a kernel on one column.
The "right" way isn't quite supported yet, which would be to just write a[:,i]
and get the right view delivered. This is mostly because PyCUDA doesn't know
about strides just yet, and assumes that arrays are contiguous chunks of
memory. Of course, your particular case doesn't violate that assumption, so
you can feel free to hack just that case into GPUArray.__getitem__. (It
already deals with the 1D case.) (Or feel free to hack stride treatment into
PyCUDA--even a tiny step in that direction would be pretty cool.)
Another way is to obtain a 1D view of the 2D array (which would require you to
implement a .flat attribute mimicking numpy, also simple by copying what's
happening in __getitem__).
The last way is to just grab the pointer from ary.gpuarray, increment it by
the right multiple of ary.dtype.itemsize and run with that.
Vince Fulco, CFA, CAIA
A posse ad esse non valet consequentia
“the possibility does not necessarily lead to materialization”
I am writing an article on PyCUDA, to be submitted to a parallel computation
journal. I am doing this in the hope of increasing awareness of PyCUDA within
the academic community. The article's focus is on run-time code generation
("RTCG"--aka "metaprogramming" if you like fancy words).
To support the argument that GPU RTCG is indeed a helpful technique, and that
PyCUDA is a tool that supports the technique well, I am including a section
where I would like to showcase research projects that have benefited from
The deal is pretty simple:
- You get to describe your research in a paragraph or two, plus possibly a
picture. The paragraph should establish how your project obtained a tangible
benefit from using PyCUDA and/or RTCG.
- You'll be listed as a coauthor.
If you're interested, please get in touch with me off-list, and we can work
from there. I can't wait to hear how you are making use of PyCUDA!
On Mittwoch 02 September 2009, you wrote:
> We use your package PyCuda, which is very nice, for our neural simulator
> Brian (http://www.briansimulator.org). I noticed you recently released
> PyOpenCL, and I was wondering whether you would have any advice on which
> one we should choose for our project. In fact I just set up a google
> group to discuss GPU issues for our simulations and I would very happy
> if you could give us some insight about it:
> http://groups.google.fr/group/brian-on-gpu. Thanks a lot and
> congratulations for your great work!
> Best wishes,
> Romain Brette
I put together a wiki page with the best answer I can provide, here:
As I've mentioned on the page, I deliberately put this on a Wiki--if you feel
like you can contribute, please don't hesitate to do so.
The fact that I've released PyOpenCL doesn't mean that I'm abandoning PyCUDA,
quite the contrary actually. Writing PyOpenCL was a way for me to take a close
look at the CL spec, and I liked what I saw. CUDA on the other hand is nice,
too. Both CL and CUDA have their advantages. Neither PyCUDA nor PyOpenCL is
going away--I expect there'll be a niche for both of them.