I finally have the time to contribute something to compyte, so I had a
look at its sources. As far as I understand, at the moment it has:
- sources for GPU platform-dependent memory operations (malloc()/free()/...)
- sources for array class, which uses abstract API of these operations
- some high-level Python code like scan.py with generalized kernels
So I have a few questions about this layout:
1. It does not have its own setup script; is it supposed to be a part
of PyCuda/PyOpenCL and get compiled with them or is it just a
In the former case, the second question:
2. Why was it decided to keep low-level memory operations in compyte?
They require platform-specific makefiles (and the one currently
committed to repo is quite specific and belongs to Frederic, as I
understand from the paths inside). The only reason I can see is to
keep memory operations API inside the single module, but in this case
we will have to copy specialized building code from setup scripts of
PyCuda/PyOpenCL, which, I think, is more serious violation of DRY.
Memory API is small and unlikely to change much; we can create
separate modules in PyCuda/PyOpenCL and pass pointers to memory
functions to compyte using capsules.
3. Moreover, we can export some simple memory API in each of
PyCuda/PyOpenCL (something like opaque Buffer object and memory
functions that use it, like it's done in PyOpenCL) for people who want
some fine tuning and do not want to use our general ndarray-like
object. In fact, compyte developers are such people too. There can be
some problems, of course, if you are inclined to write ndarray module
in C (is it really necessary?), but they are, of course, solvable.
Hope this makes sense. In any case, at the moment I am mostly
interested in the answer to the first question, because it will remove
some uncertainty in my current understanding.
On Sun, 19 Jun 2011 20:30:01 +0100, Irwin Zaid <irwin.zaid(a)physics.ox.ac.uk> wrote:
> Hi all,
> The following code fails using the current source of pycuda.
> import numpy
> import pycuda.autoinit
> import pycuda.gpuarray
> a_gpu = pycuda.gpuarray.zeros(10, dtype = numpy.int32)
> The problem is that pycuda.tools.dtype_to_ctype returns "short unsigned
> int" for numpy.uint16, whereas pycuda.tools.parse_c_arg checks only for
> "unsigned short" and "unsigned short int". This is easily fixable by
> adding a check for "short unsigned int" to parse_c_arg.
Thanks for the report. Fixed in a few more variants in PyCUDA and
On Fri, 17 Jun 2011 21:59:42 +0100, Sebastian Nowozin <nowozin(a)gmail.com> wrote:
> Dear Andreas,
> I am a postdoc researcher in the field of machine learning and
> recently discovered your PyOpenCL package that makes OpenCL so much
> more pleasurable to use!
> One question occured to me: I tried to search the documentation,
> examples, and wiki for any mention as to the support of the OpenCL
> type float4. It seems from the kernel side this is no problem, and
> your examples use float2. But on the host you allocate these as
> I wonder, what is the correct way from PyOpenCL to allocate say a
> (64,64) array of float4 elements? Can you show me a two line example,
> i.e. a cl.Buffer call, or pyopencl.array call, that does this?
np.zeros((64, 64), pyopencl.array.vec.float4)
pyopencl.array.zeros(queue, (64, 64), pyopencl.array.vec.float4)
should work. (Feature only in git at the moment, new release version due
out next week.)
> If float4 is not supported on the host side, then is it sufficient to
> simply call the kernel on a correctly-sized block of memory, i.e. a
> 64*64*4 vector of float elements?
> Thank you very much for PyOpenCL, it is very much appreciated!
You're welcome. :) One final comment: Please do send such questions to
the mailing list.
On Wed, 15 Jun 2011 08:54:22 -0600, Amy Frederico <afrederico(a)gmail.com> wrote:
> Hi, Andreas! Thanks for the reply. It didn't seem to work but maybe
> I'm doing something wrong. Here's all the info I have:
> So I created the file in my home dir and here it is:
> [amy@amy ~]$ cat .aksetup-defaults.py
> [amy@amy ~]$
> Then I try easy_install again:
> [amy@amy ~]$ sudo easy_install pyopencl
> install_dir /usr/lib/python2.7/site-packages/
> Searching for pyopencl
> Reading http://pypi.python.org/simple/pyopencl/
> Reading http://mathema.tician.de/software/pyopencl
> Best match: pyopencl 2011.1beta3
> Downloading http://pypi.python.org/packages/source/p/pyopencl/pyopencl-2011.1beta3.ta...
> Processing pyopencl-2011.1beta3.tar.gz
> Running pyopencl-2011.1beta3/setup.py -q bdist_egg --dist-dir
> In file included from src/wrapper/wrap_cl.cpp:1:0:
> src/wrapper/wrap_cl.hpp:20:19: fatal error: CL/cl.h: No such file or directory
> compilation terminated.
> error: Setup script exited with error: command 'gcc' failed with exit status 1
Can you also post the gcc compiler command line (I assume you clipped
that from the output)?
I have the following code (condensed and renamed from the actual test code):
ctx = cl.Context(dev_type=cl.device_type.GPU)
print orig_array.shape, orig_array.strides
cl_array = numpy.asarray(orig_array, order='F')
cl_array.resize((256, 512, 512))
print cl_array.shape, cl_array.strides
cl_image = cl.Image(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
Which produces the following output:
Traceback (most recent call last):
File .../test_cl.py", line 485, in test_50_opencl
cl_image = cl.Image(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
LogicError: clCreateImage3D failed: invalid image size
-------------------- >> begin captured stdout << ---------------------
(176, 512, 512) (1048576, 2048, 4)
(256, 512, 512) (4, 1024, 524288)
And I can't figure out what's going wrong. I've played with the
pitches argument to Image, I've tried resizing the array to 512**3,
I've checked the max image dimensions:
<pyopencl.Platform 'Apple' at 0x7fff0000>
<pyopencl.Device 'GeForce GT 330M' at 0x1022600>
And per my reading of the spec:
CL_INVALID_IMAGE_SIZE if image_width, image_height are 0 or if
image_depth <= 1 or if they exceed values specified in
CL_DEVICE_IMAGE3D_MAX_WIDTH, CL_DEVICE_IMAGE3D_MAX_HEIGHT or
CL_DEVICE_IMAGE3D_MAX_DEPTH respectively for all devices in context or
if values specified by image_row_pitch and image_slice_pitch do not
follow rules described in the argument description above.
That should be fine. I'm on a 2010 macbook pro running OSX 10.6,
which means OpenCL 1.0.
Hi, I hope this is the right place to post this!
So I have installed the CUDA libraries downloaded from nVidia here:
When I run this:
I get this error: src/wrapper/wrap_cl.hpp:20:19: fatal error: CL/cl.h: No such
file or directory
That file DOES exist and it's here:
I even set the LD_LIBRARY_PATH and it doesn't seem to make a difference.
Any help is appreciated!
Hello,Is there a option that I can make opencl protable ? without the need to install ATI STream SDK?
Lets say that I just give all the drivers with the pacakge and the pyopencl and then pyopencl will use it from there?
Is there a way to do it?
I am getting errors trying to access the pyopencl.array.vec class (as
http://documen.tician.de/pyopencl/array.html#vector-types). I am using
Python 2.7.1 (r271:86832, Dec 21 2010, 11:31:35)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyopencl
>>> import pyopencl.array
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'vec'
I couldn't find a reference to a class definition for vec in the egg,
On Sat, 11 Jun 2011 09:15:46 +0200, Patric Holmvall <patric.hol(a)gmail.com> wrote:
> Hi Andreas!
> Thank you for your reply. Well, my main concern is that I have written my
> code wrong, so that it runs with the same device in both cases.
> Even if ATI uses another thread stride, I think that I would still hit the
> sweetspot somewhere if taking each multiple of 32 through 32000.
> Another thing I thought about that I havent had the time to test yet: could
> it be that OpenCL does a bad job at splitting up the global size on ATI in
> this case, that I need to set local sizes etc manually?
I'd say that's definitely worth a try.
On Fri, 10 Jun 2011 09:41:08 +0200, Patric Holmvall <patric.hol(a)gmail.com> wrote:
> I have an AMD processor (Phenom II X4 965 BE) and an ATI graphics card (ATI
> Radeon HD 5850), I have uploaded properties to pastebin. When I set
> each of them as devices separately in my code, they seem to run with pretty
> much the exact same performance in all different kinds of python/pyopencl
> programs. When I benchmarked one of my programs, I got these results:
> The plot points are taken for evenly increased thread size by 32. Note that
> the first maximum of GTX470 is 32*448, which is the warp size times the
> amount of CUDA cores. The speedup is roughly 80 times versus the
> unparallelized equivalent C-code (not so heavily optimized), when measuring
> only execution time.
> Can it really be that I have managed to get a CPU and GPU that performs
> pretty much exactly the same in all programs? It looks like as if the CPU is
> always chosen as device instead of the GPU, as the performance seems a bit
> low for the graphics card. I use the same code as in benchmark_all.py to
> fech and set my devices:
> for platform in cl.get_platforms():
> for device in platform.get_devices():
> if cl.device_type.to_string(device.type) == "GPU":
> gpu_dev = device
> elif cl.device_type.to_string(device.type) == "CPU":
> cpu_dev = device
> Then I create the context with:
> ctx = cl.create_some_context([cpu_dev])
> ctx = cl.create_some_context([gpu_dev])
> Any thoughts or ideas?
Unless you suspect a bug in the wrapper, this isn't really the right
forum to ask about this.
That said, note that for example the same memory access pattern that
Nvidia recommends (32/64-bit with stride 1 from one work item to the
next) will not work well on 5xxx-level ATI cards.
It's a common fallacy to take CUDA (or Nvidia-tuned CL) code and just
take perf results on other hardware for granted as-is. CL does not
relieve you from knowing about your hardware or tuning for each
individual device. But once you do understand your hardware, it makes
getting good performance across many of them easier.