Hello.
I attach patch updating pycuda.tools.DeviceData and
pycuda.tools.OccupancyRecord
to take new devices into consideration. I have tried to maintain "style" of
those classes
and introduced changes only when necessary. I have done changes using my old
notes
and NVIDIA Occupancy Calculator. Unfortunately I currently do not have
access to Fermi
to test those fully.
Best regards.
Tomasz Rybak
Hi all,
I'm observing the following behavior with latest (git-fetched today)
pycuda and opencl versions on Snow Leopard 10.6.4:
$ python
>>> import pycuda.driver
>>> import pyopencl
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/pyopencl-0.92beta-py2.6-macosx-10.6-i386.egg/pyopencl/__init__.py",
line 3, in <module>
import pyopencl._cl as _cl
AttributeError: 'NoneType' object has no attribute '__dict__'
$ python
>>> import pyopencl
>>> import pycuda.driver
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.6/site-packages/pycuda-0.94rc-py2.6-macosx-10.6-i386.egg/pycuda/driver.py",
line 1, in <module>
from _driver import *
AttributeError: 'NoneType' object has no attribute '__dict__'
This worked with stable versions. Does anyone know why is this happening?
(One may ask why do I need both libraries in the same program. I have
the set of tests for my module, which can use both Cuda and OpenCL,
and it is convenient to run all the tests using the single file.
Although it is not a critical issue, I'm just curious).
Best regards,
Bogdan
Hi,
I want to run element-wise computations on different parts of an
array. Loading each part of the array to device mem when needed turned
out to use up a lot of time and not really speed things up compared to
cpu. Instead, I want to once load the data array into device mem and
provide pointers to which elements to look at (I do have the numpy
view/slice of the array). I looked into different ways of doing this
but can't seem to find the right approach, any help would be
appreciated.
ElementwiseKernel seems to support range and slicing now, however, my
code is (cuda) c and I import it as a SourceModule which probably
means I can't use the ElementwiseKernel approach.
-Thomas
hi,
ive been endlessly trying to install pycuda on a red hat dist. machine, but
to no avail. it would be much appreciated if i could get some help.
i am able to get past the configure part of the installation, but the when i
"make" , the problem occurs. here is my siteconf.py file
BOOST_INC_DIR = ['/usr/local/include/boost/']
BOOST_LIB_DIR = ['/usr/lib']
BOOST_COMPILER = 'gcc4.1.2'
BOOST_PYTHON_LIBNAME = ['boost_python']
BOOST_THREAD_LIBNAME = ['boost_thread']
CUDA_TRACE = False
CUDA_ROOT = '/usr/local/cuda/'
CUDA_ENABLE_GL = False
CUDADRV_LIB_DIR = ['/usr/lib']
CUDADRV_LIBNAME = ['cuda']
CXXFLAGS = ['-DBOOST_PYTHON_NO_PY_SIGNATURES']
LDFLAGS = []
i beleive i built boost with gcc version 4.1.2
the error im getting is.....
/usr/local/include/boost/type_traits/remove_const.hpp:61: instantiated
from ‘boost::remove_const<<unnamed>::pooled_host_allocation>’
/usr/local/include/boost/python/object/pointer_holder.hpp:127:
instantiated from ‘void* boost::python::objects::pointer_holder<Pointer,
Value>::holds(boost::python::type_info, bool) [with Pointer =
std::auto_ptr<<unnamed>::pooled_host_allocation>, Value =
<unnamed>::pooled_host_allocation]’
src/wrapper/mempool.cpp:278: instantiated from here
/usr/local/include/boost/type_traits/detail/cv_traits_impl.hpp:38: internal
compiler error: in make_rtl_for_nonlocal_decl, at cp/decl.c:5067
i only included the ends. if you want the entire thing let me know. but the
error seems to point to a gcc problem. ive read thru
your archives but doesnt seem to solve this problem
if someone could shed some light on this issue, i would very appreciate it.
thanks
-nhieu
HI,everyone.
I installed the pycuda followed by the "Installing PyCuda on Windows - Windows 7 64-bit with Visual Studio Professional 2008 (Strictly Binary Versions) ". The process seems too success. The base demo(hello_demo) which not used Pycuda.GPUArray can passed. But when I run all program by pycuda.Gpuarray, I recived this exception message:
"
Traceback (most recent call last):
File "C:\Users\summit\workspace\JCudaIP\src\pycudatest.py", line 11, in <module>
a_doubled = (2*a_gpu).get()
File "C:\Python27\lib\site-packages\pycuda\gpuarray.py", line 285, in __rmul__
return self._axpbz(scalar, 0, result)
File "C:\Python27\lib\site-packages\pycuda\gpuarray.py", line 164, in _axpbz
other, out.gpudata, self.mem_size)
File "C:\Python27\lib\site-packages\pycuda\driver.py", line 273, in function_prepared_async_call
func.launch_grid(*grid)
pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources
"
My card is "GeForce GT 425M", more detail relate to the properties of the CUDA devices :
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1041825792 bytes
Multiprocessors x Cores/MP = Cores: 2 (MP) x 48 (Cores/MP) = 96 (Co
res)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.12 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads
can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Vers
ion = 3.20, NumDevs = 1, Device = GeForce GT 425M
My OS is Win7 x64, and VC 9.0
These problem trouble me a long time, wait for your help! Thanks
I'm trying to install PyCUDA 0-94.2 under Windows 7, using MSVC 2008 Express,
Python 2.6.2, boost 1.46.1 and Nvidia driver 196.21 (CUDA 3.0.1). Everything
successfully compiled, however, importing pycuda does not work.
Traceback (most recent call last):
File "test_driver.py", line 4, in <module>
from pycuda.tools import mark_cuda_test
File "C:\Python26\lib\site-packages\pycuda\tools.py", line 30, in <module>
import pycuda.driver as cuda
File "C:\Python26\lib\site-packages\pycuda\driver.py", line 1, in <module>
from pycuda._driver import *
ImportError: DLL load failed: The specified procedure could not be found.
Note that this isn't "The specified module..." error.
I tried different boost versions, binary installers for both boost and pycuda,
and I am still getting the same error. What could be the problem?
Tesla 2070C
From: Bryan Catanzaro [mailto:bryan.catanzaro@gmail.com]
Sent: 25 March 2011 19:07
To: Bergtholdt, Martin
Cc: pycuda(a)tiker.net
Subject: Re: [PyCUDA] printf
What device are you running this on?
- bryan
On Mar 25, 2011, at 11:05 AM, "Bergtholdt, Martin" <martin.bergtholdt(a)philips.com<mailto:martin.bergtholdt@philips.com>> wrote:
Hi,
I'm trying to run the printf example in the wiki:
>>> import pycuda.driver as cuda
>>> import pycuda.autoinit
>>> from pycuda.compiler import SourceModule
>>>
>>> mod = SourceModule("""
... #include <stdio.h>
...
... __global__ void say_hi()
... {
... printf("I am %d.%d\\n", threadIdx.x, threadIdx.y);
... }
... """)
>>>
>>> func = mod.get_function("say_hi")
>>> func(block=(4,4,1))
>>>
Everything works fine, but I do not get any output on the last line. Any ideas what can go wrong here?
Cuda 3.2
Windows-7
pycuda.VERSION (2011, 1)
MSVC 9.0 2008 (32-bit)
________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
_______________________________________________
PyCUDA mailing list
PyCUDA(a)tiker.net<mailto:PyCUDA@tiker.net>
http://lists.tiker.net/listinfo/pycuda
Hi,
I'm trying to run the printf example in the wiki:
>>> import pycuda.driver as cuda
>>> import pycuda.autoinit
>>> from pycuda.compiler import SourceModule
>>>
>>> mod = SourceModule("""
... #include <stdio.h>
...
... __global__ void say_hi()
... {
... printf("I am %d.%d\\n", threadIdx.x, threadIdx.y);
... }
... """)
>>>
>>> func = mod.get_function("say_hi")
>>> func(block=(4,4,1))
>>>
Everything works fine, but I do not get any output on the last line. Any ideas what can go wrong here?
Cuda 3.2
Windows-7
pycuda.VERSION (2011, 1)
MSVC 9.0 2008 (32-bit)
________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.
Hi,
I'm new to CUDA and PyCUDA, and have having a problem indexing multiple grids.
I'm using an older CUDA enabled card (Quadro FX 1700) before I begin writing for
a larger GPU. I've been trying to understand the relationship between threads,
blocks, and grids in the context of my individual card. To do so, I've set up a
simple script.
The following code will work just fine, printing out an array of values 0-99
----------------------------------------------------------------------------------------------
import pycuda.gpuarray as gpuarray
import pycuda.driver as drv
import pycuda.autoinit
def testgpu2():
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void kernel1(float *z1)
{
const int i = (blockIdx.x * blockDim.x) + threadIdx.x;
const int j = (blockIdx.y * blockDim.y) + threadIdx.y;
z1[i*j]=i*j;
}
""")
kernel1 = mod.get_function("kernel1")
z1 = numpy.zeros((100)).astype(numpy.float32)
kernel1(drv.Out(z1),block=(10,10,1),grid=(1,1))
print z1
return z1
----------------------------------------------------------------------------------------------
However, what if I have an array that's 1024 in length? If I understand the
documentation correctly, block=(16,16,1) is the max value (256 threads) allowed
for my hardware, which means I have to increase the number of grids. If I
change the parameters of my script to:
z1 = numpy.zeros((1024)).astype(numpy.float32)
kernel1(drv.Out(z1),block=(16,16,1),grid=(2,2))
How do I correctly index the array locations in my kernel function given
multiple grids (z1[???]=???) ? There is a gridDim property, but not gridIdx
property, like with threads and blocks.
Thanks!
Mike
Hello.
I attach patch containing code that is supposed to initialise Sobol32
direction vectors.
Initially I wanted to implement it myself based on article, but after
looking at CUDA 4.0
(and new functions that would also need to be reimplemented) I decided just
to call existing
function in library.
Few remarks about implementation:
1) I have added new option to configure.py and new configuration variable
HAVE_LIBRARIES
It might not be ideal for CURAND, but I am also thinking about additional
libraries that are
available in CUDA 4.0 in cudatoolssdk_4.0_linux_64.run.
I believe that it might be possible to include CURAND in the PyCUDA core
(and hence
add dependency on libcurand), but I also believe that dependency on e.g.
libcupti should be optional.
In such case I would propose changing name from HAVE_LIBRARIES to HAVE_TOOLS
and changing
patch not to surround CURAND-related code with ifdefs.
2) I have added enum direction_vector_set; currently it has only one value,
but it will have more
in CUDA 4.0 - so please leave it as is.
3) Although there is ability to call function get_direction_vectors32 and to
create object Sobol32*
there is no code that joins those two. I am thinking whether this code
should be in C or in Python
- so for now please just apply patch to curand branch, and after I have
good implementation
I will send it to the list.
4) Sorry - no documentation for C code yet.
Thanks.
Best regards.