Hi Bogdan,
My latest version of the CUFFT wrapper is included in my python package
PARRET (Parallel RestoreTools).
http://www.mathcs.emory.edu/~yfan/PARRET/doc/index.html
The wrapper is provided by _cufft.py and cufft.py in the package. I also
use a class for fftplan. I hope these two files can help your
development of the wrapper.
For those attending PyCon 2010 next weekend, I'll present a poster on
PARRET. Hope to see some of you there.
Best Wishes,
Ying Wai (Daniel) Fan

Hello Everyone,
I recently saw (on this list) the conversation about comparing GPU speed to
CPU speed. The following code snippet was discussed:
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
import numpy.linalg as la
from pycuda.compiler import SourceModule
blocks = 64
block_size = 512
nbr_values = blocks * block_size
n_iter = 100000
#############
# GPU SECTION
mod = SourceModule("""
__global__ void addone(float *dest, float *a, int n_iter)
{
const int i = blockDim.x*blockIdx.x + threadIdx.x;
for(int n = 0; n < n_iter; n++) {
a[i] = sin(a[i]);
}
dest[i] = a[i];
}
""")
addone = mod.get_function("addone")
a = numpy.ones(nbr_values).astype(numpy.float32)
a += 1 # a is now an array of 2s
dest = numpy.zeros_like(a)
start = drv.Event()
end = drv.Event()
start.record()
addone(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1),
block=(block_size,1,1))
#stop timer
end.record()
end.synchronize()
secs = start.time_till(end)*1e-3
print "GPU time:", secs
print "GPU result starts with...", dest[:3]
It was then suggested that this could be made even faster with the use of
shared memory. As I am currently trying to better understand shared memory,
I tried to just that:
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
import numpy.linalg as la
from pycuda.compiler import SourceModule
blocks = 64
block_size = 512
nbr_values = blocks * block_size
n_iter = 100000
#############
# GPU SECTION
mod = SourceModule("""
__global__ void addone(float *dest, float *a, int n_iter)
{
const int i = blockDim.x*blockIdx.x + threadIdx.x;
__shared__ float A[512];
A[threadIdx.x] = a[i];
syncthreads();
for(int n = 0; n < n_iter; n++) {
A[threadIdx.x] = sin(A[threadIdx.x]);
}
syncthreads();
dest[i] = A[threadIdx.x];
}
""")
addone = mod.get_function("addone")
a = numpy.ones(nbr_values).astype(numpy.float32)
a += 1 # a is now an array of 2s
dest = numpy.zeros_like(a)
start = drv.Event()
end = drv.Event()
start.record()
addone(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1),
block=(block_size,1,1))
#stop timer
end.record()
end.synchronize()
secs = start.time_till(end)*1e-3
print "GPU time:", secs
print "GPU result starts with...", dest[:3]
It seems simple enough, but for some reason the shared memory version is
slower. Not by much, but it is definitely slower.
Am I doing something wrong, or is the time to transfer to shared memory just
greater than the time saved by using shared memory?
Thank you for your time!
---Chris Heuser

Hello,
I noticed some strange thing recently. Consider the following kernel:
__global__ void test(float *out)
{
float a[2] = {0,0};
a[0] = 1;
a[1] = 2;
out[0] = a[0];
out[1] = a[1];
}
As far as I understand, a[2] should go into registers. According to
PTX file, which is produced by nvcc when I'm compiling simple .cu
file, it seems to be the case:
.entry test (
.param .u32 __cudaparm_test_out)
{
.reg .u32 %r<3>;
.reg .f32 %f<4>;
.loc 15 192 0
$LBB1_test:
.loc 15 198 0
ld.param.u32 %r1, [__cudaparm_test_out];
mov.f32 %f1, 0f3f800000; // 1
st.global.f32 [%r1+0], %f1;
.loc 15 199 0
mov.f32 %f2, 0f40000000; // 2
st.global.f32 [%r1+4], %f2;
.loc 15 200 0
exit;
$LDWend_test:
} // test
But when I'm trying to compile this kernel with PyCuda, for some
reason this function has attribute shared_size_bytes==20. Can anyone
please explain why is the size of shared memory non-zero? I am
completely at a loss here.

This may be of some interest to the list (PyCUDA was covered during
the "Massively Parallel Computing" Class at Harvard).
N
---------- Forwarded message ----------
From: Hanspeter Pfister <pfister(a)seas.harvard.edu>
Date: Fri, Feb 5, 2010 at 8:19 PM
Subject: GPU Computing Summer Internships @ Harvard
To: Hanspeter Pfister <pfister(a)seas.harvard.edu>
Please distribute this call to interested undergraduate students. Thank you!
- Hanspeter
We are pleased to announce summer research opportunities for
undergraduates in scientific GPU computing. The amount of scientic
data created and gathered is growing faster than the growth rate of
computational power, a trend that has been called the information big
bang or the data deluge. We are seeking undergraduates majoring in
science and engineering who have an interest in scientific GPU
computing to participate in a 10-week summer internship.
Students will work with faculty in highly interdisciplinary groups on
projects to develop new algorithms and systems that use GPUs (Graphic
Processing Units) for applications in astronomy, quantum chemistry,
and neuroscience. Participants are part of a large, diverse research
community through organized and informal interactions with students,
mentors, and faculty in the summer intern programs of the Harvard
School of Engineering and Applied Sciences.
Program dates are June 6 - August 14.
Students receive a stipend of $3900 and a $300 travel allowance, as
well as on-campus housing at no additional charge.
Students may apply at www.reusite.seas.harvard.edu/application.
Application deadline is February 28.
For more information on the SciGPU project, please visit http://scigpu.org
--
Nicolas Pinto
Ph.D. Candidate, Brain & Computer Sciences
Massachusetts Institute of Technology, USA
http://web.mit.edu/pinto

Hi all,
I have a problem and almost a solution, but my program crashes after
running and giving the right output. The problem is to use an external
library, in this case chag:pp [1] which is a header file only library,
along with PyCUDA. What I'm trying to do is to use their compaction
algorithm on a pycuda array. I've successfully built a DLL from a C++
file which basically does this:
struct Predicate
{
__device__ bool operator() (double value) const
{
return value>0.0;
}
};
void find_positive(
int x_gpu_start,
int x_gpu_end,
int y_gpu_start,
int count_start
)
{
pp::compact(
(double *)x_gpu_start, /* Input start pointer */
(double *)x_gpu_end, /* Input end pointer */
(double *)y_gpu_start, /* Output start pointer */
(size_t *)count_start, /* Storage for valid element count */
Predicate() /* Predicate */
);
}
I'm then using ctypes to load this library and call that function, like
this:
lib = cdll.LoadLibrary('testchagpp')
find_positive = lib.find_positive
x = randn(100)
y = to_gpu(x)
out = to_gpu(zeros(len(x)))
count = to_gpu(array([0], dtype=int))
find_positive(int(y.gpudata), int(y.gpudata)+len(x)*8,
int(out.gpudata), int(count.gpudata))
The program works, in that it correctly compacts the array x into the
array y, but afterwards I get the error:
This application has requested the Runtime to terminate it in an unusual
way.
Please contact the application's support team for more information.
I'm on 64 bit Windows 7, but using a 32 bit build of Python, PyCuda, etc.
I see two possibilities. Possibility 1: I've just made a mistake
somewhere about something obvious, and this approach should work.
Possibility 2: I can't pass pycuda data to the DLL file and this causes
something to break somewhere.
Any ideas?
Many thanks,
Dan Goodman
[1] http://www.cse.chalmers.se/~billeter/pub/pp/

Hi all,
I'm attempting to use PyCUDA to write a GPU backend for a Gaussian process
package, http://pymc.googlecode.com/files/GPUserGuide.pdf. I'm pretty
excited about it; if all goes well it should be possible to hide the GPU
completely from the user. PyCUDA has made development much easier so far.
Now I've got to wrap some functions in the MAGMA library,
http://icl.cs.utk.edu/magma/. I'm getting segfaults in function
magma_spotrf_gpu, which does Cholesky factorizations. This is my first time
using both ctypes and MAGMA, and I'm pretty new to GPU programming in
general, so I'm having a hard time diagnosing the problem.
However, I've wrapped the 'CPU-only' version of magma_spotrf_gpu without
segfaults, which leads me to believe that the problem is the argument A,
which is the on-gpu matrix to be factorized. magma_spotrf_gpu expects A to
be a float*, and I'm giving it
ctypes.POINTER(ctypes.c_float).from_address(int(A))
where A is the output of pycuda's to_device.
The documentation of magma_spotrf_gpu follows, and a short program
demonstrating the problem follows that. Unfortunately it's not possible to
run the program; you would have to build magma v0.2 as a shared library, and
so far the package is only available in binary form, and the 64-bit version
was not compiled with -fPIC. If anyone is keen to run it, please email me
off-list and I'll tell you how to go about it.
Any help would be much appreciated.
Thanks in advance,
Anand
---------------- magma_sportf_gpu documentation
int magma_spotrf_gpu(char *uplo, int *n, float *a, int *lda,
float *work, int *info)
SPOTRF computes the Cholesky factorization of a real symmetric positive
definite matrix A.
The factorization has the form
A =U**T*U, if UPLO=’U’,
or
A=L *L**T, if UPLO=’L’,
where U is an upper triangular matrix and L is lower triangular.
This is the block version of the algorithm, calling Level 3 BLAS.
UPLO (input) CHARACTER*1
= ’U’: Upper triangle of A is stored;
= ’L’: Lower triangle of A is stored.
N (input) INTEGER The order of the matrix A. N >= 0.
A (input/output) REAL array on the GPU, dimension (LDA,N)
On entry, the symmetric matrix A. If UPLO = ’U’, the
leading
N-by-N upper triangular part of A contains the upper
triangular part of the matrix A, and the strictly lower
triangular part of A is not referenced. If UPLO = ’L’, the
leading N-by-N lower triangular part of A contains the
lower
triangular part of the matrix A, and the strictly upper
triangular part of A is not referenced.
On exit, if INFO = 0, the factor U or L from the Cholesky
factorization A = U**T*U or A = L*L**T.
LDA (input) INTEGER The leading dimension of the array A. LDA
>= max(1,N).
WORK (workspace) REAL array, dimension at least (nb, nb) where
nb can be
obtained through magma_get_spotrf_nb(*n)
Work array allocated with cudaMallocHost.
INFO (output) INTEGER
= 0: successful exit
< 0: if INFO = -i, the i-th argument had an illegal value
> 0: if INFO = i, the leading minor of order i is not
positive definite, and the factorization could not be completed.
---------------- pymagma.py
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import ctypes
magma_path =
'/home/malaria-atlas-project/sources/magma_0.2s/lib/objectfiles/libmagma.so'
libmagma = ctypes.cdll.LoadLibrary(magma_path)
int_pointer = lambda x: ctypes.pointer(ctypes.c_int(x))
char_pointer = lambda x: ctypes.pointer(ctypes.c_char(x))
_magma_spotrf_gpu = libmagma.magma_spotrf_gpu
_magma_spotrf_gpu.restype = ctypes.c_uint
_magma_spotrf_gpu.argtypes = [ctypes.c_char_p,
ctypes.POINTER(ctypes.c_int),
ctypes.POINTER(ctypes.c_float),
ctypes.POINTER(ctypes.c_int),
np.ctypeslib.ndpointer(dtype='float32'),
ctypes.POINTER(ctypes.c_int)]
magma_get_spotrf_nb = libmagma.magma_get_spotrf_nb
magma_get_spotrf_nb.restype = ctypes.c_uint
magma_get_spotrf_nb.argtypes = [ctypes.POINTER(ctypes.c_int)]
def magma_spotrf_gpu(uplo, n, A, lda, work):
info = 1
from IPython.Debugger import Pdb
Pdb(color_scheme='LightBG').set_trace()
_magma_spotrf_gpu(char_pointer(uplo),
int_pointer(n),
#
====================================================================
# = I think this argument is responsible for the
segmentation fault. =
#
====================================================================
ctypes.POINTER(ctypes.c_float).from_address(int(A)),
int_pointer(lda),
work.ctypes.data_as(ctypes.POINTER(ctypes.c_float)),
int_pointer(info))
return info
if __name__ == '__main__':
n = 10
# Create matrix to be factored
A_orig = (np.eye(n) + np.ones((n,n))*.3).astype('float32')
A_gpu = cuda.to_device(A_orig)
# Allocate pagelocked work array
nwork = magma_get_spotrf_nb(int_pointer(n))
print nwork
work_gpu = cuda.pagelocked_empty((nwork,nwork), dtype='float32')
# # Do Cholesky factorization
info = magma_spotrf_gpu('L', n, A_gpu, n, work_gpu)