This may be a stupid question, but has anyone got any experience in
implementing a parallel prefix sum with pycuda? I am trying to replace a
portion of numpy code I had which was used to calculate the root mean square
difference between two images. Theres a few steps in this procedure: a new
array is created from the pairwise difference between the two source arrays.
Each element of this new array is squared, and finally the elements of the
new array are summed and divided by the array length in order to calculate
the average square error between the two source images. My numpy code looks
like this (hopefully the indentation is intact):
def rmse(first, second):
""" Returns the root mean square error between two image arrays. """
assert numpy.size(first) == numpy.size(second)
difference = numpy.diff(numpy.dstack(
return 1 - math.sqrt(numpy.mean(difference**2)) / 255.0
It's not hard to see how pycuda could be used to accomplish the first few
steps of calculating the squared pairwise difference between two source
arrays: I was thinking that I would upcast both image arrays to floats,
since I would be unable to accurately square or subtract them from each
other as 8-bit unsigned integers. Off the top of my head, the kernel might
start looking like this
mod = SourceModule("""
__global__ void diff_and_square(float *dest, float *a, float *b)
const int i = threadIdx.x;
dest[i] = a[i]*a[i] + b[i]*b[i] - (2 * a[i] * b[i]);
but I'm stuck on the last step: summation.
I've read through this page on implementing parallel prefix sum with CUDA:
It's a pretty good resource for understanding how to implement a prefix scan
while avoiding bank conflicts and remaining work efficient. Unfortunately,
the source code offered for download on that page is no longer hosted, it
seems that its been replaced with similar functionality in the CUDA Data
Parallel Primitives library- I was trying to skim through this code to see
if it might be possible to plug it in and it seems quite complicated, using
a plan manager class to control the behaviour of the prefix scan in a very
I was hoping someone might be able to offer some advice on the most
painless way to accomplish this with pycuda. Thanks in advance!
Forgot to cc the list. Darn. :)
---------- Forwarded Message ----------
Betreff: Re: [PyCUDA] broadcasting and strided data
Datum: Donnerstag 27 August 2009
Von: Andreas Klöckner <lists(a)informa.tiker.net>
An: James Bergstra <james.bergstra(a)gmail.com>
On Dienstag 25 August 2009, you wrote:
> It probably requires the expertise of a few people to get the design
> right, so I'm reluctant even to try to put a patch together. First,
> it requires some changes to the data container. Some of the issues
> that come up are:
> - what should be the strides for broadcastable dimensions (I like 0,
> but numpy does it differently)
Assigning a stride zero seems to be a good "simple" way, even though it seems
like that might waste some processor power on unneeded index math. How does
numpy do it?
> - should strides be in data-type units or byte units
I find this somewhat irrelevant--for the kernels themselves, data-type units
are likely more useful, especially if texturing is used. For storage, looking
like numpy by using byte offsets might be the way to go. Since doing the
conversion on the host right ahead of the kernel invocation is easy and cheap,
I don't see why we can't have our cake and eat it, too. (see also next
> - should strides and dimensions be stored in host memory, device
> memory, or both (how/when should they be synchronized?)
Host memory seems to be the right place, as kernel parameters, originating
from there, are the only way by which a variable can be easily spread to each
thread, without incurring a global mem access penalty.
> As the data structure gets more complicated, the kernels become more
> complex too. My experience is that all kernels have to have a
> "general" version that is pretty slow, and progressively, more and
> more special cases get optimized.
I find it helpful to do things the other way around. Solve a rather special
case first, then generalize. Even incremental solutions are valuable.
> Kernel code generators get bloated.
Deciding on the right complexity for the generators is definitely an issue.
Rome wasn't built in a day. Going about this incrementally and not rushing it
seems like a wise idea. You're not on your own.
> How many kinds of kernels are there in PyCUDA right now? (Given that
> the same code-generator can produce many elementwise kernels, I mean
> to count that as one *kind* of kernel.) How many things would break
> if arrays were strided?
Two. Elementwise kernels and reduction kernels are the kinds currently
implemented. All of the GpuArray functionality is written in terms of these
Does PyCuda support broadcasting? How can I add a vector the rows or
columns of a 2 or 3-dimensional GpuArray?
Related: does PyCuda support viewing of sub-regions of other
GpuArrays? Like, can I operate on just the first few rows or columns
of a matrix?
I looked a bit through the documentation, and I tried to do it... and
it looks like neither of these things works.
Thank you for providing access to this video. I was on the fence
about taking the extensive time to 'get closer to the metal' but
Nicolas did an exemplary job re-enforcing why PyCuda is THE more
efficient prototyping tool. Your and his time is much appreciated.
Vince Fulco, CFA, CAIA
A posse ad esse non valet consequentia
“the possibility does not necessarily lead to materialization”
On Tue, Aug 25, 2009 at 2:00 PM, <pycuda-request(a)tiker.net> wrote:
> Send PyCUDA mailing list submissions to
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> You can reach the person managing the list at
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of PyCUDA digest..."
> Today's Topics:
> 1. SciPy'09: Advanced Tutorial on PyCUDA (Andreas Kl?ckner)
> Message: 1
> Date: Tue, 25 Aug 2009 12:47:05 -0400
> From: Andreas Kl?ckner <lists(a)informa.tiker.net>
> Subject: [PyCUDA] SciPy'09: Advanced Tutorial on PyCUDA
> To: pycuda(a)tiker.net
> Message-ID: <200908251247.08722.lists(a)informa.tiker.net>
> Content-Type: text/plain; charset="us-ascii"
> Hi all,
> The SciPy'09 conference  ended less than a week ago. At the invitation of
> the SciPy'09 organizers (especially Fernando Perez), Nicolas Pinto gave a talk
> in the Advanced Tutorials track  on how to use PyCUDA to do GPU scripting.
> First, I would like to use this opportunity to publicly thank Nicolas for all
> the work and time he put into making this tutorial a reality. Second, I would
> like to point out the video of his session, which you can watch at this
> Have fun,
>  http://conference.scipy.org/
>  http://conference.scipy.org/advanced_tutorials
Matthew Brett, on 2009-08-21 11:51, wrote:
> > Indeed. In the future, if OpenCL is the way to go, it may even be
> > helpful to have Numpy using OpenCL directly, as AMD provides an SDK
> > for OpenCL, and with Larrabee approaching, Intel will surely provide
> > one of its own.
> I was just in a lecture by one of the Intel people about OpenCL:
> He offered no schedule for an Intel OpenCL implementation, but said
> that they were committed to it.
> The lectures in general were effective in pointing out what a
> time-consuming effort it can be moving algorithms into the the
> parallel world - including GPUs. The lecture just passed cited the
> example of a CUDA-based BLAS implementation on the GPU that was slower
> than the CPU version. Making BLAS go faster required a lot of work
> to find optimal strategies for blocking, transfer between CPU / GPU
> shared memory / GPU registers, vector sizes and so on - this on a
> specific NVIDIA architecture.
> I can imagine Numpy being useful for scripting in this
> C-and-assembler-centric world, making it easier to write automated
> testers, or even generate C code.
> Is anyone out there working on this kind of stuff? I ask only because
> there seems to be considerable interest here on the Berkeley campus.
This is exactly the sort of thing you can do with PyCUDA, which makes it
In particular, see the metaprogramming portion of the docs:
The metaprogramming section of the slides and source code from Nicolas
Pinto and Andreas Klöckner *excellent* SciPy2009 Tutorials is even more thorough:
I feel like some gpuarray methods (the one that override sub, mul, etc.)
have a non-intuitive behavior:
>>> import pycuda.autoinit
>>> from pycuda import gpuarray
>>> import numpy as np
>>> a_cpu = np.random.randn(2)
>>> a_gpu = gpuarray.to_gpu(a_cpu)
>>> # good
... print a_cpu - a_gpu.get()
[ 0. 0.]
>>> # bad results (no exception)
... print a_cpu - a_gpu
[[ 0.00797661 0.7765328 ] [ 0.79872256 0.61745578]]
>>> # exception
... print a_gpu - a_cpu
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
if other == 0:
ValueError: The truth value of an array with more than one element is
ambiguous. Use a.any() or a.all()
How can we fix this properly? Should we inherit from ndarray?
Ph.D. Candidate, Brain & Computer Sciences
Massachusetts Institute of Technology, USA
after some wrangling, I managed to build boost and even install pycuda
on my system (dual core macbook pro, with GeForce 8600M GT, Mac OS
10.5.8). I could import pycuda:
In : import pycuda
In : pycuda.VERSION
Out: (0, 93)
But, when I try to import pycuda.autoinit, I got the following error:
In : import pycuda.autoinit
ImportError Traceback (most recent call last)
/Users/arokem/<ipython console> in <module>()
2 import pycuda.tools
5 assert cuda.Device.count() >= 1
6 class CompileError(Error):
2): Library not loaded: @rpath/libcuda.dylib
Reason: image not found
This seemed like something there was looking for the library
libcuda.dylib. This library happened to be, in my system under
/usr/local/cuda/lib, so I made a copy of this library under
Now, still no problem to import pycuda, but when I try to import
pycuda.autoinit, I get the following error:
In : import pycuda.autoinit
Fatal Python error: Interpreter not initialized (version mismatch?)
After that, the ipython session closes and I get a system message that
python quit unexpectedly.
Does anyone know what this is about? I saw that others on the list had
similar problems recently. Has anyone managed to solve this issue?
Other possibly pertinent information is that I built the Boost
libraries from source and that the python I am using is EPD 4.3 (those
things might or might not be related to this problem).
Helen Wills Neuroscience Institute
University of California, Berkeley