I'm hoping someone will offer some advice for a newbie. I'm getting the following error:
Traceback (most recent call last):
File "demo_mandelbrot.py", line 161, in <module>
test = Mandelbrot()
File "demo_mandelbrot.py", line 120, in __init__
File "demo_mandelbrot.py", line 147, in create_image
self.draw(-2.13, 0.77, -1.3, 1.3)
File "demo_mandelbrot.py", line 133, in draw
output = calc_fractal(q, maxiter)
File "demo_mandelbrot.py", line 66, in calc_fractal_opencl
File "/Library/Python/2.7/site-packages/pyopencl-2011.1.2-py2.7-macosx-10.7-intel.egg/pyopencl/__init__.py", line 221, in kernel_call
global_offset, wait_for, g_times_l=g_times_l)
pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work group size
To avoid cluttering this email, I've provided a dump of my properties (AMD card) here:
I'm working out of git:
In : pyopencl.version.VERSION
Out: (2011, 1, 2)
and also have:
I did a search here http://search.gmane.org/?query=invalid+work+group+size&group=gmane.comp.p...
but it didn't shed much light, for me anyway.
When I run matrix-multiply.py from the examples directory and choose a Tesla
C1060 from my machine I get
print out also shows that the result is completely garbage, except for the first
row. If I change dimensions to
a_width = 4*block_size
a_height = block_size
b_width = block_size
I get correct results.
Is it a known issue with this file?
On Tue, 20 Sep 2011 17:19:50 -0500, Robert L Cloud <rcloud(a)uab.edu> wrote:
> However, even for small domains, where most of everything should fit into
> cache, my program is far slower than an OpenMP program.
Just one more suggestion from my side: Try and do more per work item. It
might be that the AMD implementation has a fairly high setup cost for
each work item, and so having fewer (larger) ones is going to be
beneficial. In my experience, the AMD implementation gives performance
about as good as gcc, while Intel can be significantly better, depending
on what you're trying to do.
I realize that this list is primarily for GPU computing, PyOpenCL being a
descendent of PyCUDA, and I am aware of Vasiliy's work on optimizing matrix
operations and know that he is very well respected in the GPGPU community.
I have done an amount of GPU programming in the past, but in my current
work environment, we are using traditional(funny how beowulf machines are
now considered traditional) clusters with MPI with little to no shared
memory programming. I'm trying to make arguments to change this, arguing
that we should leverage heterogeneous environments for our codes.
Right now I am primarily concerned with making the argument that OpenCL is a
viable option for numerical computing on shared memory multicore computers.
But the results I am seeing are not indicating this. This may be due(and
probably is) to my inadequacy as a programmer, or it may be that the current
implementations of OpenCL for the CPU are not utilizing the full resources
of the computer. Perhaps there is a better forum for discussing OpenCL on
the CPU, but it is still an immature language and I thought I might find
some insite on this list.
I know that 'top' is not a means to measure the efficiency of a numerical
algorithm, but I am under the understanding that it can measure the
occupancy of the CPU. It was troubling to me to see that the Python process
was utilizing nearly 800% of the processors(on my 8 core Xeon) whereas the
PyOpenCL was only utilizing 400-500% and no other significant CPU programs
were running concurrently.
I used the same program semantics on both the CPU and GPU. It could be(but
I doubt it) that the poor performance is as a result of creating memory
buffers rather than directly using host allocated memory. However, the
copies are only performed after ever several hundred iterations and
shouldn't be the bottleneck of the program.
I have only looked at domains of up to 1000 x 1000, and for smaller domains,
up to 500 x 500, my OpenCL program is faster than a sequential Cython
implementation, probably due to being able to fit most of the matrix in
Cache. At larger sizes, though, the OpenCL runtimes explode.
I've read two books on OpenCL(Gaster's and Munshi's) and at least Gaster's
states that for the CPU, you should let the OpenCL driver chose the
appropriate work group size. However I am wondering that if I am getting
many cache misses, whether it would be beneficial to break them up myself.
However, even for small domains, where most of everything should fit into
cache, my program is far slower than an OpenMP program.
Anyway, you will have to pardon any stupidity or unsophistication on my
part, as I am from Alabama :)
Robert L Cloud
Student, School of Engineering
The University of Alabama at Birmingham
You may indeed get better results with the Intel OpenCL CPU
implementation - I got a 3.5x speedup compared to AMD's implementation,
though I suspect it could be due to more optimized implementation of
sin/cos functions, which would not help in your case.
As for not finding the intel opencl as a platform, I don't think the
PyOpenCL siteconfig matters - do you have the intel openCL library
declared in /etc/OpenCL/vendors/ ?
But as was said previously, your code really is limited by memory
transfers, you can't expect much acceleration (and on the GPU you really
need the coalesced memory transfers - like transferring blocks of 16x16
points and then working on them...).
Also, that "if" in the middle of your kernel is probably going to kill
parallelism, at least on the GPU (the compiler can't assume parallel
threads will go the same way, so execution would be serialized?).... The
way it's written in Python is much more friendly for parallelization, in
What domain sizes are you studying in this problem? A 4-point stencil is
memory bound, so you shouldn't expect to outperform the STREAMs benchmark
(calculated using the appropriate reuse ratio depending on how far you
unroll the kernel).
Have you looked at Volkov's work on this problem? They have a very good
CUDA implementation for 3-D stencil operators, a lot of what they say
applies in 2-D: http://www.cs.berkeley.edu/~volkov/volkov10-parcfd.pdf
Also, you can't use 'top' as a reliable measure of computational performance
when analyzing numerical code. You need to work out the number of
floating-point instructions (or memory bandwidth) your CPU or GPU is capable
of per cycle and look at the requirements of your operator.
On Tue, Sep 20, 2011 at 10:21 PM, Robert L Cloud <rcloud(a)gmail.com> wrote:
> I've done some analysis comparing CPU(on a nehalem) and GPU(on a tesla)
> performance of PyOpenCL to parallel Cython using OpenMP. The performance of
> PyOpenCL on the CPU(Intel Nehalem with AMD OpenCL 1.1) was very poor, even
> slower than a single threaded Cython program. I realize that my OpenCL
> implementation was fairly poor, but I expected performance to be a bit
> better than it was.
> The analysis is available here:
> I'm hoping that someone can give some insight into how to improve it or why
> it is so bad.
> Also, I would like to run the analysis again with the Intel OpenCL driver,
> but can't get PyOpenCL to recognize both Intel and AMD platforms, when I run
> get_platforms it only shows AMD. Here is my siteconf.py file:
> rcloud@Vertex:~/sources/pyopencl-2011.1.2$ cat siteconf.py
> BOOST_INC_DIR = 
> BOOST_LIB_DIR = 
> BOOST_COMPILER = 'gcc43'
> BOOST_PYTHON_LIBNAME = ['boost_python-gcc43-mt']
> USE_SHIPPED_BOOST = True
> CL_TRACE = False
> CL_ENABLE_GL = False
> CL_ENABLE_DEVICE_FISSION = True
> CL_INC_DIR =
> CL_LIB_DIR =
> CL_LIBNAME = ['OpenCL']
> CXXFLAGS = 
> LDFLAGS = 
> thanks in advance,
> Robert L Cloud
> ,,Warum willst du dich von uns Allen
> Und unsrer Meinung entfernen?"
> Ich schreibe nicht, euch zu gefallen;
> Ihr sollt was lernen.
> PyOpenCL mailing list
I've done some analysis comparing CPU(on a nehalem) and GPU(on a tesla)
performance of PyOpenCL to parallel Cython using OpenMP. The performance of
PyOpenCL on the CPU(Intel Nehalem with AMD OpenCL 1.1) was very poor, even
slower than a single threaded Cython program. I realize that my OpenCL
implementation was fairly poor, but I expected performance to be a bit
better than it was.
The analysis is available here:
I'm hoping that someone can give some insight into how to improve it or why
it is so bad.
Also, I would like to run the analysis again with the Intel OpenCL driver,
but can't get PyOpenCL to recognize both Intel and AMD platforms, when I run
get_platforms it only shows AMD. Here is my siteconf.py file:
rcloud@Vertex:~/sources/pyopencl-2011.1.2$ cat siteconf.py
BOOST_INC_DIR = 
BOOST_LIB_DIR = 
BOOST_COMPILER = 'gcc43'
BOOST_PYTHON_LIBNAME = ['boost_python-gcc43-mt']
USE_SHIPPED_BOOST = True
CL_TRACE = False
CL_ENABLE_GL = False
CL_ENABLE_DEVICE_FISSION = True
CL_INC_DIR = ['/home/rcloud/sources/amd/AMD-APP-SDK-v2.5-RC2-lnx64/include']
CL_LIBNAME = ['OpenCL']
CXXFLAGS = 
LDFLAGS = 
thanks in advance,
Robert L Cloud
,,Warum willst du dich von uns Allen
Und unsrer Meinung entfernen?"
Ich schreibe nicht, euch zu gefallen;
Ihr sollt was lernen.
On Thu, 15 Sep 2011 12:52:30 -0400, "Andrew J. Hesford" <andrew.hesford(a)rochester.edu> wrote:
> Dear Dr. Kloeckner,
> I have discovered a bug in your PyOpenCL implementation of 32-bit vector integers. In pyopencl/array.py, on line 62, you list the tuple ('int', np.uint32) instead of the correct ('int', np.int32). As a consequence, attempts to make an int3 vector type actually make a uint3 vector type. Replacing the incorrect tuple with the correct one appears to fix the problem.
Fixed in PyCUDA and PyOpenCL git.
I'm trying to run some demos of PyOpenCL under OSX Snow Leopard (10.6.8) and the demos aren't working for me...
I have tried both the macports installation of py26-pyopencl and have tried building from source and both do the same thing:
160 [examples ] > python demo.py
 <pyopencl.Device 'ATI Radeon HD 5870' at 0x1021b00>
 <pyopencl.Device 'Intel(R) Xeon(R) CPU X5650 @ 2.67GHz' at 0x1020400>
Choice, comma-separated :
Traceback (most recent call last):
File "demo.py", line 9, in <module>
queue = cl.CommandQueue(ctx)
pyopencl.LogicError: CommandQueue failed: invalid value
Anyone know if there's some extra thing I need to do or might I be missing something? As far as I can tell, this should work with the opencl that comes with the osx developer tools...