Dear all,
we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
training points.
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> pyopencl-2013.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py",
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
Host-Python-File
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem[0], 4 bytes cmem[3]
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
mistake.
We would be very grateful for every hint.
Best regards,
Justin Heinermann,
University Oldenburg
Dear Python/OpenCL community,
I am pretty new (py)opencl and encountered a problem, maybe it a lack of understanding of openCL, but I found strange python seg-faults:
test program:
#!/usr/bin/python
import numpy, pyopencl
ctx = pyopencl.create_some_context()
data=numpy.random.random((1024,1024)).astype(numpy.float32)
img = pyopencl.image_from_array(ctx, ary=data, mode="r", norm_int=False, num_channels=1)
print img
System: debian sid: pyopencl2012.1 (the same code works with debian stable and v2011.2)
Here is the backtrace obtained with GDB:
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff340c253 in pyopencl::create_image_from_desc(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#2 0x00007ffff342de36 in _object* boost::python::detail::invoke<boost::python::detail::install_holder<pyopencl::image*>, pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>, boost::python::arg_from_python<unsigned long>, boost::python::arg_from_python<_cl_image_format const&>, boost::python::arg_from_python<_cl_image_desc&>, boost::python::arg_from_python<boost::python::api::object> >(boost::python::detail::invoke_tag_<false, false>, boost::python::detail::install_holder<pyopencl::image*> const&, pyopencl::image* (*&)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>&, boost::python::arg_from_python<unsigned long>&, boost::python::arg_from_python<_cl_image_format const&>&, boost::python::arg_from_python<_cl_image_desc&>&, boost::python::arg_from_python<boost::python::api::object>&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#3 0x00007ffff342e06f in boost::python::detail::caller_arity<5u>::impl<pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::detail::constructor_policy<boost::python::default_call_policies>, boost::mpl::vector6<pyopencl::image*, pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object> >::operator()(_object*, _object*) ()
from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#4 0x00007ffff311715b in boost::python::objects::function::call(_object*, _object*) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#5 0x00007ffff3117378 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
#6 0x00007ffff3120593 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#7 0x00007ffff3445983 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<pyopencl::error, void (*)(pyopencl::error const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(pyopencl::error const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#8 0x00007ffff3120373 in boost::python::handle_exception_impl(boost::function0<void>) ()
from /usr/lib/libboost_python-py27.so.1.49.0
#9 0x00007ffff3115635 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
Thanks for your help.
If you are not able to reproduce this bug, I should mention it to debian.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF
Dear Michael,
first off, please make sure the list stays cc'd so that there is a
permanent record of what we find.
Michael Boulton <michael.boulton(a)bristol.ac.uk> writes:
> I'm not sure what ICD loader it's using, whatever the default one is on
> the systems. Is there a way to find out?
What's the last CL runtime that you installed? (Check the timestamp of
libOpenCL.so.1 to match up with install dates.) Perhaps try 'strings
libOpenCL.so.1'.
> This is a greatly stripped down version of the code that causes the same
> problem: https://gist.github.com/anonymous/6a7441e392167512717a
> The way it's used in the original code is that I have ~200 things to
> run, and using a processing pool then I can limit it to only run as many
> threads as there are devices. I then use itertools.cycle to create a
> cycling iterator over the device ids (I use xrange in this example)
> which passes the next free device id to each thread so it know which
> device id to use (in the real code I'm using a semaphore to make
> absolutely sure they're not being used at the same time, but I don't
> think it's needed?). If I'm doing something really stupid then that
> would be good to know!
I've tried this with CPU-only devices, and it's fine. I believe that the
reason this fails with GPUs is because fork() is unsafe once the Nvidia
ICD is initialized. I imagine that the this happens on the very first CL
call. The initialization probably maps some memory from the GPU into the
process's address space, and it's unclear what it means for two
processes to be fighting over a single map, if the map even survives the
fork. I've asked Nvidia this a long while back, and their answer was,
"don't do it."
> One other thing I forgot to mention is that I find it a bit confusing
> that platform.get_devices throws an exception when there are no devices
> of the specified type available, when it seems like it would make more
> sense to just return an empty list. Is that just so that it causes some
> kind of explicit error like how clGetDeviceIDs will return
> CL_DEVICE_NOT_FOUND?
Fixed in git, thanks.
Hope that helps,
Andreas
Hi,
I suspect that I'm running into this problem described on
http://documen.tician.de/pyopencl/tools.html:
"The constructor pyopencl.Buffer() can consume a fairly large amount of
processing time if it is invoked very frequently. For example, code
based on pyopencl.array.Array can easily run into this issue because a
fresh memory area is allocated for each intermediate result."
So I wanted to try the recommended memory pools, but I'm a bit lost.
Does anybody have a (preferably working...) example?
Thanks,
sven
Michael Boulton <michael.boulton(a)bristol.ac.uk> writes:
> On 19/09/13 15:18, Andreas Kloeckner wrote:
>>> The problem is that whenever I get the OpenCL platforms (whether it be
>>> indirectly by doing create_some_context() or by directly calling
>>> get_platforms()) it allocates either 32 or 64 gigabytes of memory
>>> (seemingly at random depending on the system and type of devices). If I
>>> try to delete the platform objects then the memory still stays there, so
>>> it means that whenever I start a run I'm allocating a huge chunk of
>>> memory that I can never deallocate.
>> I don't think those are "real" memory allocations in the sense that they
>> are backed by physical system memory. You're probably seeing them in
>> "top" (or similar). I'm guessing they might be some sort of aperture
>> into which the driver maps GPU memory and various other stuff. Looking
>> at /proc/self/maps from within the process should give you a better idea
>> of what exactly is being mapped.
> I get that it's not actually using the memory, but I've done plenty of
> OpenCL stuff in C/C++ before and I've never seen that behaviour. It's
> harmless but I was wondering if there was something wrong, and if it was
> related to the other problems I was having
Huh, weird. PyOpenCL doesn't have any right to behave differently than
OpenCL as used directly from C/C++.
> I get the out of resources error even if I spawn one thread at a time.
> The devices are definitely different in each thread - they're named
> different things, they have a different memory address (I originally
> tried just creating a context just with
> cl.Context(dev_type=cl.device_type.GPU) but then it throws a
> "LogicError: Context failed: invalid platform").
>
> For what it's worth, I'm also getting this problem when trying to run
> between a CPU and an intel xeon phi as well. When I try to run it using
> 2 AMD 7970s, then it creates the context fine but then the thread just
> silently exits without throwing an exception when I try to create a
> command queue with the context.
Can you boil this down to a simple reproducing test case that I could
try on my hardware?
>>> (which I'm also guessing should actually show up as a
>>> pyopencl.RuntimeError"?).
>> Could you check the type of the exception? I don't see how the current
>> code would throw a non-pyopencl exception.
> I worded this badly, I meant that the error shows up as "RuntimeError"
> when all the other pyopencl exceptions show up as
> "pyopencl.<name_of_exception>"
>>> then the command queue will 'become' invalid at some point. Calling
>>> queue.finish() would throw an 'invalid queue' exception, but trying to
>>> launch a kernel using the queue would cause it to just hang silently
>>> and I'd have to kill the process in linux.
>> That's also how (Nvidia) OpenCL "reports" segmentation faults (for
>> instance), i.e. bugs in your code. Are you sure there aren't any bugs in
>> your code that might cause the device to crash?
>>
>> Alternatively, have you looked at the output of 'dmesg' to see if
>> there's anything incriminating? (The messages may look like gibberish,
>> but they might say something important.)
> I checked dmesg on all the platforms I was testing it on:
> - on the one with the AMD GPUs where the thread silently exits, it seems
> to be because of a seg fault
> - nothing new shows up in dmesg when I get the "out of resources" error
> when trying to create a context on 2 different NVIDIA GPUs
> - same as above, but when trying to run across an intel cpu and an intel
> xeon phi
This is just odd--I've never seen anything like this, but since it's
occuring on wildly different implementations, it shifts the 'blame' away
From those. Another question--what ICD loader are you using?
Andreas
Michael Boulton <michael.boulton(a)bristol.ac.uk> writes:
> I'm in Simon McIntosh-Smith's group at Bristol university and I've been
> using PyOpenCL for a couple of weeks to convert some old fortran code,
> but I'm having an issue with it and Simon suggested that I talk to you
> directly.
Sure. I've cc'd the list--hope you don't mind.
> The problem is that whenever I get the OpenCL platforms (whether it be
> indirectly by doing create_some_context() or by directly calling
> get_platforms()) it allocates either 32 or 64 gigabytes of memory
> (seemingly at random depending on the system and type of devices). If I
> try to delete the platform objects then the memory still stays there, so
> it means that whenever I start a run I'm allocating a huge chunk of
> memory that I can never deallocate.
I don't think those are "real" memory allocations in the sense that they
are backed by physical system memory. You're probably seeing them in
"top" (or similar). I'm guessing they might be some sort of aperture
into which the driver maps GPU memory and various other stuff. Looking
at /proc/self/maps from within the process should give you a better idea
of what exactly is being mapped.
> I'm trying to do something with multiple threads at the moment where I
> am looking what devices are available in the main thread, and spawning
> one more thread for each device. if there's 2 GPUs and a CPU on the
> system, this results in it allocating over 200 GB of memory instantly
> which is obviously not intended. Whenever I try to create a context
> after this happens then it throws a "RuntimeError: Context failed: out
> of resources"
Are you sure you're putting the contexts onto different devices?
Contexts are quite memory-hungry on the device side (on Nvidia).
> (which I'm also guessing should actually show up as a
> pyopencl.RuntimeError"?).
Could you check the type of the exception? I don't see how the current
code would throw a non-pyopencl exception.
> Before I was getting the devices like this I was trying to do it another
> way, but I was running into another problem which I think may be related
> to some weird internal python thing. I was initially trying to create a
> context/command queue for each device in the main thread then sending it
> to each spawned thread (I assume it pickles it to do this - I'm not that
> well versed on the internals of python)
No, threads share data and address space directly. No pickles.
> then the command queue will 'become' invalid at some point. Calling
> queue.finish() would throw an 'invalid queue' exception, but trying to
> launch a kernel using the queue would cause it to just hang silently
> and I'd have to kill the process in linux.
That's also how (Nvidia) OpenCL "reports" segmentation faults (for
instance), i.e. bugs in your code. Are you sure there aren't any bugs in
your code that might cause the device to crash?
Alternatively, have you looked at the output of 'dmesg' to see if
there's anything incriminating? (The messages may look like gibberish,
but they might say something important.)
Hope this helps,
Andreas
Hello,
After this mail on Intel IGP ... I was wondering if anybody had
feed-back from an Intel Iris-Pro IGP regarding OpenCL programming under
linux: I need to replace my laptop soon, so shall I go for such chip or
prefer a discrete NVidia chip ?
Cheers,
--
Jérôme Kieffer
On-Line Data analysis / Software Group
ISDD / ESRF
tel +33 476 882 445
Hi,
I'm playing with the Ranlux RNG on a laptop with an Ivy Bridge CPU and
Intel's opencl implementation, and the same code runs on the CPU but
fails on the HD 4000 GPU.
This is the error I'm getting when I (interactively) choose the HD 4000:
AssertionError: length of argument type array (4) and CL-generated
number of arg
uments (5) do not agree
Here's the description of the devices I have:
Choose device(s):
[0] <pyopencl.Device 'Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz' on
'Intel(R) Ope
nCL' at 0x2faf290>
[1] <pyopencl.Device 'Intel(R) HD Graphics 4000' on 'Intel(R) OpenCL' at
0x6008d
ff0>
The lines that the traceback in ipython is pointing to are (where of
course ctx is the context previously created):
qu1 = cl.CommandQueue(ctx)
ran = cl.clrandom.RanluxGenerator(qu1)
As I said, the same code runs (albeit slowly) on the CPU, if I choose
[0] at the prompt.
Thanks,
Sven
Hi all,
Simon McIntosh-Smith from Bristol University just let me know that he
and Tom Deakin have published a new set of lecture slides and excercises
(with solutions!) to teach (more generally) OpenCL and (specifically)
PyOpenCL. I've added a link to this and a few older tutorials to
PyOpenCL's main documentation page:
http://documen.tician.de/pyopencl/#tutorials
Simon requested that if you spot issues with the tutorials, you file
them as issues here:
https://github.com/HandsOnOpenCL/Lecture-Slides
Andreas
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello,
I am currently in the process of porting a CUDA application (using
PyCUDA) to OpenCL. Part of this process requires me to 'wrap' a
cl.Buffer object such that:
class MyBuf(object):
def __init__(self, ...):
# Do some work
self._buf = cl.Buffer(...)
# Other methods ...
# Magic method
def _something_(self):
return self._buf._something_
with the _something_ method/property being defined such that one can do:
myb = MyBuf(...)
prg = cl.Program(ctx, ...).build()
prg.mykern(a_queue, ..., myb, ...)
and have it work as expected. When using a ctypes-like library this
can be accomplished by providing an _as_parameter_ property. With
PyCUDA it is sufficient to implement __int__/__long__ and have it
return the desired pointer value. What is the equivalent for pyopencl?
Regards, Freddie.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.20 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlInSdsACgkQ/J9EM/uoqVeKGACdGRYx/bzZt6iqORa3W05bz02Z
XnIAoJwhAwDh/L4W6W5L8+HMK5od7P26
=TJqY
-----END PGP SIGNATURE-----