first of all: nice piece of work Andreas!
Now to my problem: I installed CUDA 2.1 on my T61 laptop computer with a
Nvidia Quadro NVS 140M graphics card w/ 128MB memory.
git version of PyCUDA (0.93 beta I assume) installs fine but I get some
errors when I try to run tests. First test_driver:
...E....Eterminate called after throwing an instance of 'cuda::error'
what(): cuMemFree failed: invalid context
Or runnung gpuarray speed test:
Traceback (most recent call last):
File "undistributed/test_gpuarray_speed.py", line 83, in <module>
File "undistributed/test_gpuarray_speed.py", line 27, in main
b = gpuarray.zeros((size,), dtype=numpy.float32)
line 409, in zeros
result = GPUArray(shape, dtype, stream, allocator)
line 75, in __init__
self.gpudata = self.allocator(self.size * self.dtype.itemsize)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
PyCUDA WARNING: I'm being asked to destroy a
context that's part of the current context stack.
I will pick the next lower active context from the
context stack. Since this choice is happening
at an unspecified point in time, your code
may be making false assumptions about which
context is active at what point.
Call Context.pop() to avoid this warning.
If Python is terminating abnormally (eg. exiting upon an
unhandled exception), you may ignore this.
test_gpuarray only threw some warnings:
UserWarning: behavior change: arange guessed dtype other than float32.
suggest specifying explicit dtype.
warn("behavior change: arange guessed dtype other than float32. "
Ran 18 tests in 14.756s
Here are some debug information you might need:
In : pycuda.autoinit.device.get_attributes()
In : pycuda.autoinit.device.compute_capability()
Out: (1, 1)
In : pycuda.autoinit.device.total_memory()
In : pycuda.autoinit.device.make_context()
terminate called after throwing an instance of 'cuda::error'
what(): cuMemFree failed: invalid context
All the other tests were fine, so PyCUDA works nicely.
Could you please check where the "invalid context" error comes from?
And maybe add a memory checker to your tests so we low-mem GPU user are
not discriminated ;-).
Keep up the good work!
I'm sorry to bother the list, but I can't seem to generate
appropriately typed arguments for the memset_d32 function
my current attempt looks like :
Which generates the following error :
Boost.Python.ArgumentError: Python argument types in
pycuda._driver.memset_d32(numpy.uint32, numpy.uint32, numpy.uint32)
did not match C++ signature:
memset_d32(unsigned int dest, unsigned int data, unsigned int size)
how do I cast to this particular unsigned int type ?
i'm afraid i'm of no use there. been windows-free for going on 10 years
ananth ranga wrote:
> Oh thats great, thatnks alot. really appreciate it. I am trying to
> install pycuda on windows and kind of struggling with it. could ou
> please run me through it? I have VS 05 and 08 but not 03 , is that
> On Tue, Jun 30, 2009 at 11:49 AM, Derek Anderson<public(a)kered.org> wrote:
>> well, both matrices have to be squarish. but even for say 100x120*120x100,
>> i would think not. here were my performance numbers when i wrote it:
>> (includes memory transfer times)
>> (4160×4160)*(4160×4160) = 43.0X faster than numpy
>> (4096×4096)*(4096×4096) = 34.0X
>> (3900×3900)*(3900×3900) = 47.3X
>> (2048×2048)*(2048×2048) = 28.2X
>> (1024×1024)*(1024×1024) = 58.8X
>> (512×512)*(512×512) = 24.1X
>> (256×256)*(256×256) = 6.3X
>> (128×128)*(128×128) = 1.1X
>> CPU: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping 06
>> GPU: nVidia Corporation GeForce 8800 GT (rev a2)
>> but, you *might* get a modest increase (<5x) if you're keeping the matrices
>> on the card and performing the multiplications many times before you pull it
>> back to main memory. (likely, if you're doing svd :)
>> ananth ranga wrote:
>>> Hey mine is also an pretty evenly sized matrix. its (120*100). So you
>>> suggesting that for this evenly sized small matrix i can expect speed
>>> up in SVD calculation? or you mean it should be a larger sized and
>>> even sized matrix to get good speed up?
>>> On Tue, Jun 30, 2009 at 11:31 AM, Derek Anderson<public(a)kered.org> wrote:
>>>> np. yes, for more evenly sized matrices it's much faster. (for >500^2
>>>> btw if just matrix multiplication is what you're looking for, i wrote a
>>>> numpy wrapper for it a while back:
>>>> ananth ranga wrote:
>>>>> Thanks derek. I read some paper which suggest a speed up of upto 60
>>>>> when the matrix size is big and almost even for size less than (500 *
>>>>> On Tue, Jun 30, 2009 at 9:53 AM, Derek Anderson<public(a)kered.org> wrote:
>>>>>> my experience with trying to cuda-ize svd/nmf calculations is that
>>>>>> not really a good fit for cuda. specifically, most of your expensive
>>>>>> operations are matrix multiplications over very long and narrow
>>>>>> (mxk or kxn), where m~=n (within an order of mag) but k<<(m|n). even
>>>>>> m~=2^16 (the max for cublas matrices) and k<2^8, i was barely breaking
>>>>>> with normal cpu-based blas libs.
>>>>>> ananth ranga wrote:
>>>>>>> Hello people,
>>>>>>> I am Ranga a new member to the group. I have a problem of
>>>>>>> finding svd of a matrix of size 120*100. On a CPU with the VTK
>>>>>>> implemented version its taking about 5 ms for evaluation. So I was
>>>>>>> wondering if a pycuda version of it could give me abetter reult
>>>>>>> regarding the speed.
>>>>>>> If any one has a pycuda version of SVD calculation could you please
>>>>>>> me out.
>>>>>>> PyCUDA mailing list
Thank you for the detailed response.
At the risk of belabouring, a portion of the Marsenne Twister code
contains two kernel/functions for the Box Muller transformation calcs.
One is defined __device__ and the other which draws on calcs of the
first is a __global__. Would it be possible to re-code the first as a
__global__ with appropriate changes internally as well and then wrap
the two with Pycuda or am I missing something more obvious? This may
not be an efficient use of the device but could be faster than
porting. Of course there is a larger portion of C which accesses the
host and would need to be dealt with as well.
On Mon, Jun 29, 2009 at 2:00 PM, <pycuda-request(a)tiker.net> wrote:
> Send PyCUDA mailing list submissions to
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> You can reach the person managing the list at
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of PyCUDA digest..."
> Today's Topics:
> 1. Re: Wrapping SDK code... (Andreas Kl?ckner)
> 2. Re: OpenGL interop woes... help! (Andreas Kl?ckner)
> Message: 1
> Date: Sun, 28 Jun 2009 15:51:28 -0400
> From: Andreas Kl?ckner <lists(a)informa.tiker.net>
> Subject: Re: [PyCUDA] Wrapping SDK code...
> To: pycuda(a)tiker.net
> Message-ID: <200906281551.29553.lists(a)informa.tiker.net>
> Content-Type: text/plain; charset="windows-1252"
> On Freitag 26 Juni 2009, Vince Fulco wrote:
>> Early attempts to port over the Monte Carlo Option Pricing code
>> supplied with the SDK and need to mod it for simple time series
>> bootstrapping. Not being terribly facile in C/C++ (but learning!),
>> could someone provide a short list of the critical components which
>> need to be wrapped by pycuda?
> What PyCUDA can do for you is compile and execute functions marked __global__
> in that sample's source code--i.e. code that runs on the GPU. Everything else
> is CPU code, and making that accessible is beyond the scope of PyCUDA. If you
> do want to leave that CPU code in C, there are several other packages that
> might help you, ranging from Swig, Cython, Boost Python (potentially with
> codepy), to ctypes.
> I'm guessing that you might have the most fun if you just port the CPU control
> code to Python, though--less hassle.
>> I am aware of the various
>> kernels/functions necessary from the main body of code but more
>> interested in a how-to in terms of referencing the ancillary functions
>> properly. I.E. the RNGs "MonteCarlo_SM10" and "MonteCarlo_SM13"
> See above--if you want to keep those in C, use one of the packages mentioned
> above (and worry about compiling them separately), or just quickly translate
> them to Python. (you'll find they get a fair bit shorter :P)
I am Ranga a new member to the group. I have a problem of
finding svd of a matrix of size 120*100. On a CPU with the VTK
implemented version its taking about 5 ms for evaluation. So I was
wondering if a pycuda version of it could give me abetter reult
regarding the speed.
If any one has a pycuda version of SVD calculation could you please help me out.
A few months ago I was trying to port over Nvidia's postprocessGL program,
as included with the CUDA SDK.
I am attempting to do this using the PyOpenGL bindings for python and using
pycuda. I'm a beginner to python, I know even less about openGL and less
still about CUDA, so I was mostly undertaking this as a learning excercise;
after getting frustrated about not know at all what to do I gave up for a
few months and I'm only now attempting again to make some sense of this.
I've put up a copy of my attempt so far over here:
Apologies for the very unpythonic way things are layed out, hopefully once I
have some sense of what I'm doing I'll tidy it up.
Right now I have a few problems that I'm hoping someone can clear up for me.
The first issue is to do with the pycuda.BufferObject class. I assume this
is for representing openGL pixel buffer objects, but I have no idea how to
give this PBO to the cuda kernel! The cuda kernel wants either a
DeviceAllocation, a GPUArray or something that implements the python buffer
interface... It seems like the pycuda.BufferObject class is none of these
things so how do I use it? The original C version of postprocessGL just
passes a pointer to the PBO to its kernel function.
The second problem may be out of the scope of this mailing list since its
more to do with pyopengl than pycuda per se, but if I run my program it
fails during the glReadPixels function call, with this openGL error:
err = 1282,
description = 'invalid operation',
baseOperation = glReadPixels,
cArguments = (
I've read the documentation and this error should only come up under a few
circumstances, none of which seems to be true as far as I can see
(invalid_operation is usually caused when glReadPixels would cause more data
to be written than the bound buffer could hold). Is there some other way of
setting up PBO's that I haven't figured out, or is anything in my code
If anyone can take a look at my source and offer any advice I'd really
Early attempts to port over the Monte Carlo Option Pricing code
supplied with the SDK and need to mod it for simple time series
bootstrapping. Not being terribly facile in C/C++ (but learning!),
could someone provide a short list of the critical components which
need to be wrapped by pycuda? I am aware of the various
kernels/functions necessary from the main body of code but more
interested in a how-to in terms of referencing the ancillary functions
properly. I.E. the RNGs "MonteCarlo_SM10" and "MonteCarlo_SM13"
Vince Fulco, CFA, CAIA
A posse ad esse non valet consequentia
“the possibility does not necessarily lead to materialization”
Hello all -
I'm trying to install PyCuda on OS X 10.5, using Python 2.6
(unfortunately, I can't use 2.5 in the project I'm working on.)
Has anyone successfully installed it with this configuration? It
seems to build successfully, but when I try to import pycuda.driver,
the python interpreter aborts with an error message:
Fatal Python error: Interpreter not initialized (version mismatch?)
I believe I'm building PyCuda with the Python 2.6 include files, so
I'm not sure where the version mismatch is coming. Perhaps from the
python boost library? Has anyone seen this error before?
I'm new to both CUDA and PyCUDA. I'm trying to write binary
erosion/dilation accelerator module for my project, but they are slower
then scipy.ndimage's functions.
I don't know if i'm doing something wrong(as I said, I'm new), or nvidia
nvs140m in my notebook is just not fast enough.
It would be great if someone with more powerful card could try it, or
may be some guru :) could have a look into my sources?
Source is attached.
If I get it to work, I'll share it for all :)
Anyway, CUDA and PyCUDA are great work!
I saw the earlier discussion on the handling of NaN on numpy, and I can see
that currently it is ignored when you use pure numpy min:
In : a
Out: array([ 0., NaN, 0.])
In : a.min()
In : a.argmin()
However, somehow the pycuda drv.out() leaves the array in such a state that
a.min() returns NaN while a[a.argmin()] returns something else. Not sure
exactly what causes this, as it only happens sometimes. When I have seen
this bug, it's on a large unwieldy dataset that's hard to debug. The
workaround seems to be to just use a[a.argmin()]...
Benjamin P. Horstman
Delta Upsilon International Fraternity
President, Gamers Anonymous
CWRU EECS BS/MS 2009