Cannot import both pycuda and pyopencl in the same program
by Bogdan Opanchuk
Hi all,
I'm observing the following behavior with latest (git-fetched today)
pycuda and opencl versions on Snow Leopard 10.6.4:
$ python
>>> import pycuda.driver
>>> import pyopencl
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/pyopencl-0.92beta-py2.6-macosx-10.6-i386.egg/pyopencl/__init__.py",
line 3, in <module>
import pyopencl._cl as _cl
AttributeError: 'NoneType' object has no attribute '__dict__'
$ python
>>> import pyopencl
>>> import pycuda.driver
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/pycuda-0.94rc-py2.6-macosx-10.6-i386.egg/pycuda/driver.py",
line 1, in <module>
from _driver import *
AttributeError: 'NoneType' object has no attribute '__dict__'
This worked with stable versions. Does anyone know why is this happening?
(One may ask why do I need both libraries in the same program. I have
the set of tests for my module, which can use both Cuda and OpenCL,
and it is convenient to run all the tests using the single file.
Although it is not a critical issue, I'm just curious).
Best regards,
Bogdan
8 years, 5 months
pycuda install help
by nhieu duong
hi,
ive been endlessly trying to install pycuda on a red hat dist. machine, but
to no avail. it would be much appreciated if i could get some help.
i am able to get past the configure part of the installation, but the when i
"make" , the problem occurs. here is my siteconf.py file
BOOST_INC_DIR = ['/usr/local/include/boost/']
BOOST_LIB_DIR = ['/usr/lib']
BOOST_COMPILER = 'gcc4.1.2'
BOOST_PYTHON_LIBNAME = ['boost_python']
BOOST_THREAD_LIBNAME = ['boost_thread']
CUDA_TRACE = False
CUDA_ROOT = '/usr/local/cuda/'
CUDA_ENABLE_GL = False
CUDADRV_LIB_DIR = ['/usr/lib']
CUDADRV_LIBNAME = ['cuda']
CXXFLAGS = ['-DBOOST_PYTHON_NO_PY_SIGNATURES']
LDFLAGS = []
i beleive i built boost with gcc version 4.1.2
the error im getting is.....
/usr/local/include/boost/type_traits/remove_const.hpp:61: instantiated
from ‘boost::remove_const<<unnamed>::pooled_host_allocation>’
/usr/local/include/boost/python/object/pointer_holder.hpp:127:
instantiated from ‘void* boost::python::objects::pointer_holder<Pointer,
Value>::holds(boost::python::type_info, bool) [with Pointer =
std::auto_ptr<<unnamed>::pooled_host_allocation>, Value =
<unnamed>::pooled_host_allocation]’
src/wrapper/mempool.cpp:278: instantiated from here
/usr/local/include/boost/type_traits/detail/cv_traits_impl.hpp:38: internal
compiler error: in make_rtl_for_nonlocal_decl, at cp/decl.c:5067
i only included the ends. if you want the entire thing let me know. but the
error seems to point to a gcc problem. ive read thru
your archives but doesnt seem to solve this problem
if someone could shed some light on this issue, i would very appreciate it.
thanks
-nhieu
8 years, 8 months
SparseSolve.py example
by elafrit
Hello - I'm trying to run the SparseSolve.py example. I installed PyMetis
package after fixing the configuration like here :
./configure --python-exe=python2.6 --boost-inc-dir=/usr/include/boost
--boost-lib-dir=/usr/lib/ --boost-python-libname=boost_python-mt-py26
But when running the SparseSolve.py example I encountred this error :
ImportError:
/usr/local/lib/python2.6/dist-packages/PyMetis-0.91-py2.6-linux-x86_64.egg/pymetis/_internal.so:
undefined symbol: regerrorA
What does this error means? thanks for any suggestions.
--
View this message in context: http://pycuda.2962900.n2.nabble.com/PyCUDA-SparseSolve-py-example-tp59697...
Sent from the PyCuda mailing list archive at Nabble.com.
8 years, 9 months
compiler.get_nvcc_version() problems with mpi4py/mpich2 1.3
by avidday
I have hit a wall moving some existing pycuda code to a distributed
memory cluster and am hoping someone cleverer than I can suggest a
work around.
In a nutshell, I have found that get_nvcc_version doesn't work in the
way the pyCUDA.compiler module expects with the MPI flavours I use
under certain circumstances, which makes behind the scenes JIT
compilation inside pyCUDA fail if all the MPI processes don't share a
common /tmp filesystem. The simplest repro case is this snippet:
----------
import sys
from pycuda import compiler
from mpi4py import MPI
rank = MPI.COMM_WORLD.Get_rank()
sys.stdout.write("[%d] %s\n" %(rank, compiler.get_nvcc_version("nvcc")))
----------
which will do this when run on a single node
$ mpiexec -n 4 python ./pycudachk.py
[2] None
[3] None
[1] None
[0] nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2010 NVIDIA Corporation
Built on Wed_Nov__3_16:16:57_PDT_2010
Cuda compilation tools, release 3.2, V0.2.1221
I presume this is because the pytools stdout/stderr capture routines
don't get anything on MPI processes where stdout/stderr are being
managed by the MPI runtime. If all the MPI processes share a common
/tmp filesystem, it doesn't seem to matter because the pycuda compiler
cache is visible to each process and it all just works. But if they
don't (and the cluster I am now using has local /tmp on every node),
the compiler module will need to JIT compile stuff locally on each
node, and it winds up failing because get_nvcc_version returns None,
which in turn makes the md5 hashing calls fail. Something like this:
[avidday@n0005 fim]$ mpiexec -n 2 -hosts n0008,n0005 ./fimMPItest.py
[1] mpi dims = (2,1), coords = (1,0)
[0] mpi dims = (2,1), coords = (0,0)
[1]{n0005} CUDA driver GeForce GTX 275, 1791Mb ram, using fim_sm13.cubin
[0]{n0008} CUDA driver GeForce GTX 275, 1791Mb ram, using fim_sm13.cubin
Traceback (most recent call last):
File "./fimMPItest.py", line 20, in ?
phi,its = dotest()
File "./fimMPItest.py", line 16, in dotest
return fim.fimMPIScatter2(gs,a,1.,h,maxiters=1000,tol=1e-6,CC=fimcuda.fimCC)
File "/scratch/fim/fim.py", line 464, in fimMPIScatter2
its = mpiCC.Iterate(f, h, tol=tol, maxiters=maxiters)
File "/scratch/fim/fimcuda.py", line 187, in Iterate
self.active_.fill(np.int32(0))
File "/usr/lib64/python2.4/site-packages/pycuda/gpuarray.py", line
336, in fill
File "<string>", line 1, in <lambda>
File "/usr/lib64/python2.4/site-packages/pycuda/tools.py", line 485,
in context_dependent_memoize
File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 384, in get_fill_kernel
File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 98, in get_elwise_kernel
File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 85, in get_elwise_kernel_and_types
File "/usr/lib64/python2.4/site-packages/pycuda/elementwise.py",
line 74, in get_elwise_module
File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line
238, in __init__
File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line
228, in compile
File "/usr/lib64/python2.4/site-packages/pycuda/compiler.py", line
47, in compile_plain
TypeError: update() argument 1 must be string or read-only buffer, not None
which I interpret as meaning that an internal compile to satisfy a
gpuarray fill() call is failing. Running on only a single node works
fine. If I am reading things correcly, it looks like this checksum
code in compile_plain fails because of what get_nvcc_version returns
41 if cache_dir:
42 checksum = _new_md5()
43
44 checksum.update(source)
45 for option in options:
46 checksum.update(option)
47 checksum.update(get_nvcc_version(nvcc))
The question then becomes how to fix it? It is important to note that
nvcc is available to all process and works, so I assume that the fork
itself is fine (my reading of the code says that an OSError exception
would be raised otherwise). So I am guessing the problem is only that
get_nvcc_version can return None even when the fork worked. Would it
be too hackish to have get_nvcc_version return something generic like
"nvcc unknown version" or something that would still hash ok in the
case where the fork worked but the captured output from the fork is
not available?
Suggestions or corrections to my analysis are greatly welcomed (I know
almost nothing about python, pyCUDA or MPI so be gentle if I am
completely off target here.....)
8 years, 10 months
Problems compling
by Bryan Absher
hello all,
I have heard great things about the pycuda, if only it would run. Here
is the output to the test_driver.py
$ python ./test_driver.py
Traceback (most recent call last):
File "./test_driver.py", line 4, in <module>
from pycuda.tools import mark_cuda_test
File "/usr/local/lib/python2.6/dist-packages/pycuda-0.94.1-py2.6-linux-x86_64.egg/pycuda/tools.py",
line 30, in <module>
import pycuda.driver as cuda
File "/usr/local/lib/python2.6/dist-packages/pycuda-0.94.1-py2.6-linux-x86_64.egg/pycuda/driver.py",
line 1, in <module>
from pycuda._driver import *
ImportError: /usr/local/lib/python2.6/dist-packages/pycuda-0.94.1-py2.6-linux-x86_64.egg/pycuda/_driver.so:
undefined symbol: cuMemAllocPitch_v2
$ ldd ./_driver.so
linux-vdso.so.1 => (0x00007fff80f56000)
libcuda.so.1 => /usr/lib/nvidia-current/libcuda.so.1
(0x00007fc378457000)
libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007fc378143000)
libm.so.6 => /lib/libm.so.6 (0x00007fc377ebf000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fc377ca8000)
libpthread.so.0 => /lib/libpthread.so.0 (0x00007fc377a8b000)
libc.so.6 => /lib/libc.so.6 (0x00007fc377707000)
libz.so.1 => /lib/libz.so.1 (0x00007fc3774f0000)
libdl.so.2 => /lib/libdl.so.2 (0x00007fc3772ec000)
/lib64/ld-linux-x86-64.so.2 (0x00007fc379030000)
There were a couple of warnings when compiling about undefined
functions similar to this:
gcc -pthread -fno-strict-aliasing -fwrapv -Wall -O3 -DNDEBUG -fPIC
-DBOOST_MULTI_INDEX_DISABLE_SERIALIZATION=1 -Isrc/cpp
-Ibpl-subset/bpl_subset -I/usr/local/cuda/include
-I/usr/lib/python2.6/dist-packages/numpy/core/include
-I/usr/include/python2.6 -c
bpl-subset/bpl_subset/libs/python/src/converter/from_python.cpp -o
build/temp.linux-x86_64-2.6/bpl-subset/bpl_subset/libs/python/src/converter/from_python.o
bpl-subset/bpl_subset/libs/python/src/converter/from_python.cpp: In
function ‘boost::python::converter::rvalue_from_python_stage1_data
boost::python::converter::rvalue_from_python_stage1(PyObject*, const
boost::python::converter::registration&)’:
bpl-subset/bpl_subset/libs/python/src/converter/from_python.cpp:42:
warning: ‘data.boost::python::converter::rvalue_from_python_stage1_data::construct’
may be used uninitialized in this function
I did make sure that the boost libraries are installed but it kept
giving me an error about failing to search for the boost libraries. I
think there may be a bug in the way the setup.py search to check for
the boost libraries.
Any help would be much appreciated! I am looking forward to playing
around with pycuda.
8 years, 10 months
question re complex pow
by Lev Givon
I recently attempted to run the following code with CUDA 3.2 and
Pycuda 0.94.2 on a Quadro NVS 290 installed on a Linux x86_64 system:
import pycuda.gpuarray as gpuarray
import pycuda.driver as drv
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
func_mod = SourceModule("""
#include <pycuda/pycuda-complex.hpp>
#define TYPE pycuda::complex<float>
__global__ void func(TYPE *a, TYPE *b, int N)
{
int idx = threadIdx.x;
if (idx < N)
b[idx] = pow(a[idx], 2);
}
""")
func = func_mod.get_function("func")
N = 10
a = np.complex64(np.random.rand(N)+np.random.rand(N)*1j)
b = np.complex64(np.zeros(N))
func(drv.In(a), drv.Out(b), np.uint32(N), block=(512,1,1))
print 'in: ', a
print 'out (cuda): ', b
print 'out (np): ', a**2
When I did so, I observed the following error:
pytools.prefork.ExecError: error invoking 'nvcc --cubin -arch sm_11
-I/usr/lib64/python2.6/site-packages/pycuda/../../../../include/pycuda
kernel.cu': status 2 invoking 'nvcc --cubin -arch sm_11
-I/usr/lib64/python2.6/site-packages/pycuda/../../../../include/pycuda
kernel.cu': ./kernel.cu(10): Error: External calls are not supported
(found non-inlined call to _ZN6pycuda3powERKNS_7complexIfEEi)
Casting the exponent to a float or pycuda::complex<float> prevents the
error from occurring, but casting it to int does not. Is this expected?
L.G.
8 years, 10 months
Building PyOpenCL for CPU only
by Riaan van den Dool
Andreas
Would it be possible at all to create a version of PyOpenCL that does not
link to and does not need any proprietary drivers? Maybe via a make option.
I think it will help adoption of pyOpenCL if code that is written using
pyopencl will 'always run' even if no GPU is available.
If there is already such a version, please excuse my ignorance and point me
in the right direction.
Riaan
8 years, 10 months
CURAND wrappers - Long pause after generation of the a random matrix
by Martin Laprise
Hi, I just made some experiments with the CURAND wrappers. It seem to work
very nicely except for a little detail that I can't figure out. The
initialization of the generator and the actual random number generation seem
very fast. But for what ever reason, PyCUDA take a long time to "recover"
after the number generation. This pause is significantly longer than the
actual computation and the delay increase with N. Here is an example:
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray
from pycuda.curandom import PseudoRandomNumberGenerator,
QuasiRandomNumberGenerator
import cProfile
import time as clock
def curand_prof():
N = 100000000
t1 = clock.time()
# GPU
rr = PseudoRandomNumberGenerator(0,
np.random.random(128).astype(np.int32))
data = pycuda.gpuarray.zeros([N], np.float32)
rr.fill_normal_float(data.gpudata, N)
t2 = clock.time()
print "Bench 1: " + str(t2-t1) + " sec"
if __name__ == "__main__":
t1 = clock.time()
curand_prof()
t2 = clock.time()
print "Bench 2: " + str(t2-t1) + " sec"
Here is the actual output with a GTX 260 gpu:
Bench 1: 0.0117599964142 sec
Bench 2: 4.40562295914 sec
In the example, the pause have no consequence, but if I want to use the
random matrix in an other kernel ... it's quite a delay. I've made some
research and my guess is that the problem is linked to this already reported
problem here:
http://forums.nvidia.com/index.php?showtopic=185740
Anyone knows how we can implement the solution to the wrapper ?
Martin
8 years, 10 months
CURAND wrappers
by Andreas Kloeckner
On Mon, 20 Dec 2010 21:41:16 +0100, Tomasz Rybak <bogomips(a)post.pl> wrote:
> At the same time - could you look into CURAND patch I have sent
> to the list (attached here)? Last email I have sent on 2010-12-15 22:06
> I would like to finish it and then finish prefix scan.
I've taken a look at your CURAND code, here are a few comments:
- The user should not *have* to specify generator_count. Instead, we
should supply a reasonable default based on the device's compute
capability, as you describe in the docs.
(Likewise, the docs don't need to be redundant.)
- I don't like the name "Randomizer". "RandomNumberGenerator" is long,
but IMO a better name.
- What's the difference between the quasi- and non-quasi versions? It
looks like there's a ton of duplicated code between the two. This
should be eliminated, perhaps by inheritance or through another way.
- Tests should go in tests/test_gpuarray.py.
- Rename fill_in_* to fill_*.
Thanks for your contribution! Looking forward to your comments.
Andreas
8 years, 10 months
pyfft on large 3d arrays
by Saigopal Nelaturi
Hello all,
I am implementing a simple 3d convolution on the gpu using pyfft. The
basic idea is straightforward - obtain the 3d Fourier transform for each
array, multiply and take the inverse transform of the product. I am
using pyfft for the implementation. The code below works correctly when
my input array is 256^3 but fails (executes but gives garbage results)
for a 512^3 voxel grid.
# w,h,k are the array dimensions in a power of 2
# im1, im2 are the input 3d arrays of dtype complex64
plan = Plan((w,h,k), normalize=True)
# forward transform on device
im1_gpu = gpuarray.to_gpu(im1)
plan.execute(im1_gpu)
im1_ft = im1_gpu.get()
del im1_gpu
im2_gpu = gpuarray.to_gpu(im2)
plan.execute(im2_gpu)
im2_ft = im2_gpu.get()
del im2_gpu
# do multiplication on host - can be done on device.
conv = im1_ft * im2_ft
#inverse transform on device
conv_gpu = gpuarray.to_gpu(conv)
plan.execute(conv_gpu, inverse='True')
corr = conv_gpu.get()
I don't think there's anything wrong with the code (it works for smaller
array sizes) as such but I am perplexed as to why the failure occurs. I
am running the code on a Tesla C2050 (2.8GB available memory) and so
there's enough space to hold the 512^3 array with complex64 dtype. Does
anyone have an explanation?
-Saigopal
8 years, 10 months