Dear all,
we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
training points.
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> pyopencl-2013.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py",
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
Host-Python-File
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem[0], 4 bytes cmem[3]
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
mistake.
We would be very grateful for every hint.
Best regards,
Justin Heinermann,
University Oldenburg
Dear Python/OpenCL community,
I am pretty new (py)opencl and encountered a problem, maybe it a lack of understanding of openCL, but I found strange python seg-faults:
test program:
#!/usr/bin/python
import numpy, pyopencl
ctx = pyopencl.create_some_context()
data=numpy.random.random((1024,1024)).astype(numpy.float32)
img = pyopencl.image_from_array(ctx, ary=data, mode="r", norm_int=False, num_channels=1)
print img
System: debian sid: pyopencl2012.1 (the same code works with debian stable and v2011.2)
Here is the backtrace obtained with GDB:
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff340c253 in pyopencl::create_image_from_desc(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#2 0x00007ffff342de36 in _object* boost::python::detail::invoke<boost::python::detail::install_holder<pyopencl::image*>, pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>, boost::python::arg_from_python<unsigned long>, boost::python::arg_from_python<_cl_image_format const&>, boost::python::arg_from_python<_cl_image_desc&>, boost::python::arg_from_python<boost::python::api::object> >(boost::python::detail::invoke_tag_<false, false>, boost::python::detail::install_holder<pyopencl::image*> const&, pyopencl::image* (*&)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>&, boost::python::arg_from_python<unsigned long>&, boost::python::arg_from_python<_cl_image_format const&>&, boost::python::arg_from_python<_cl_image_desc&>&, boost::python::arg_from_python<boost::python::api::object>&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#3 0x00007ffff342e06f in boost::python::detail::caller_arity<5u>::impl<pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::detail::constructor_policy<boost::python::default_call_policies>, boost::mpl::vector6<pyopencl::image*, pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object> >::operator()(_object*, _object*) ()
from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#4 0x00007ffff311715b in boost::python::objects::function::call(_object*, _object*) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#5 0x00007ffff3117378 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
#6 0x00007ffff3120593 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#7 0x00007ffff3445983 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<pyopencl::error, void (*)(pyopencl::error const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(pyopencl::error const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#8 0x00007ffff3120373 in boost::python::handle_exception_impl(boost::function0<void>) ()
from /usr/lib/libboost_python-py27.so.1.49.0
#9 0x00007ffff3115635 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
Thanks for your help.
If you are not able to reproduce this bug, I should mention it to debian.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF
Hello,
Here is the output for the two scripts (Python 2.7.5) on HD 6770M device:
$ ipython t_reduce.py
Choose device(s):
[0] <pyopencl.Device 'Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz' on 'Apple' at 0xffffffff>
[1] <pyopencl.Device 'ATI Radeon HD 6770M' on 'Apple' at 0x1021b00>
Choice, comma-separated [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[(3L, (6L,), 9L)]
$ ipython t_cbrng.py
Choose device(s):
[0] <pyopencl.Device 'Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz' on 'Apple' at 0xffffffff>
[1] <pyopencl.Device 'ATI Radeon HD 6770M' on 'Apple' at 0x1021b00>
Choice, comma-separated [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
/usr/local/lib/python2.7/site-packages/pyopencl/__init__.py:57: CompilerWarning: Built kernel retrieved from cache. Original from-source build had warnings:
Build on <pyopencl.Device 'ATI Radeon HD 6770M' on 'Apple' at 0x1021b00> succeeded, but said:
<program source>:517:30: warning: unused variable 'next_ctr'
_module0_Counter next_ctr = _module0_get_next_unused_counter(st);
^
warn(text, CompilerWarning)
(-2.0032926249999998, 9.9931568585707691)
Best regards,
Jean-Matthieu
Le 25 nov. 2013 à 04:58, Bogdan Opanchuk <mantihor(a)gmail.com> a écrit :
> Hi Pongsak,
>
> Thank you! Could you please run the second script as well
> (t_cbrng.py)? It seems that whatever the bug in reduction was, Apple
> got it fixed... maybe it's time to upgrade then.
>
> Best regards,
> Bogdan
>
> On Mon, Nov 25, 2013 at 10:29 AM, Pongsak Suvanpong <psksvp(a)gmail.com> wrote:
>> Hello
>>
>> this is the output
>>
>> psksvp@abydos:~/Workspace$ python3 t_reduce.py
>> Choose device(s):
>> [0] <pyopencl.Device 'Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz' on 'Apple' at 0xffffffff>
>> [1] <pyopencl.Device 'HD Graphics 4000' on 'Apple' at 0x1024400>
>> Choice, comma-separated [0]:1
>> Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
>> [(3, (6,), 9)]
>>
>>
>> My device is the HD4000 GPU.
>>
>> psksvp
>>
>> On 25 Nov 2013, at 12:47 am, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
>>
>>> Hello guys,
>>>
>>> If it is not too much trouble, could someone with Mavericks run these
>>> two scripts (with a GPU device) and tell me the output? I'm getting a
>>> strange results on 10.8, and the expected behavior on Linux+Tesla, and
>>> I've read that OpenCL in 10.9 got updated.
>>>
>>> In case you are interested what they do:
>>> - t_reduce.py runs a reduction of an array of 3 elements of a nested
>>> struct dtype. For some reason, if I initialize the zero element as {0,
>>> {0}, 0}, I get wrong results, and if I write 0 separately to each of
>>> its fields, the results are correct.
>>> - t_cbrng.py uses a counter-based RNG to generate 2M normally
>>> distributed floats. If I compile it with '-cl-fast-relaxed-math'
>>> option, the mean&std are correct (-2 and 10), and if I compile it with
>>> default options, both mean and std are off.
>>>
>>> Thank you in advance.
>>>
>>> Best regards,
>>> Bogdan
>>> <t_cbrng.py><t_reduce.py>_______________________________________________
>>> PyOpenCL mailing list
>>> PyOpenCL(a)tiker.net
>>> http://lists.tiker.net/listinfo/pyopencl
>>
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net
>> http://lists.tiker.net/listinfo/pyopencl
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
Hello
this is the output
psksvp@abydos:~/Workspace$ python3 t_reduce.py
Choose device(s):
[0] <pyopencl.Device 'Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz' on 'Apple' at 0xffffffff>
[1] <pyopencl.Device 'HD Graphics 4000' on 'Apple' at 0x1024400>
Choice, comma-separated [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
[(3, (6,), 9)]
My device is the HD4000 GPU.
psksvp
On 25 Nov 2013, at 12:47 am, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> Hello guys,
>
> If it is not too much trouble, could someone with Mavericks run these
> two scripts (with a GPU device) and tell me the output? I'm getting a
> strange results on 10.8, and the expected behavior on Linux+Tesla, and
> I've read that OpenCL in 10.9 got updated.
>
> In case you are interested what they do:
> - t_reduce.py runs a reduction of an array of 3 elements of a nested
> struct dtype. For some reason, if I initialize the zero element as {0,
> {0}, 0}, I get wrong results, and if I write 0 separately to each of
> its fields, the results are correct.
> - t_cbrng.py uses a counter-based RNG to generate 2M normally
> distributed floats. If I compile it with '-cl-fast-relaxed-math'
> option, the mean&std are correct (-2 and 10), and if I compile it with
> default options, both mean and std are off.
>
> Thank you in advance.
>
> Best regards,
> Bogdan
> <t_cbrng.py><t_reduce.py>_______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
Hello guys,
If it is not too much trouble, could someone with Mavericks run these
two scripts (with a GPU device) and tell me the output? I'm getting a
strange results on 10.8, and the expected behavior on Linux+Tesla, and
I've read that OpenCL in 10.9 got updated.
In case you are interested what they do:
- t_reduce.py runs a reduction of an array of 3 elements of a nested
struct dtype. For some reason, if I initialize the zero element as {0,
{0}, 0}, I get wrong results, and if I write 0 separately to each of
its fields, the results are correct.
- t_cbrng.py uses a counter-based RNG to generate 2M normally
distributed floats. If I compile it with '-cl-fast-relaxed-math'
option, the mean&std are correct (-2 and 10), and if I compile it with
default options, both mean and std are off.
Thank you in advance.
Best regards,
Bogdan
Hello
The following questions are related to an hybrid programming on multi-devices machine : 2 multi-core CPUs with 2 GPUs, MPI on CPUs and OpenCL on GPUs. The 2 GPUs are identical Nvidia k20m and both GPU are present in platform’s devices list.
- From the host side, is there a way to differentiate these GPUs (all informations from ‘device_info’ are identical) ? Is pyOpenCL able to get some PCIe bus id informations, as in pyCUDA, for example ?
- An easy way to differentiate these GPUs is the list of devices returned by ‘platform.get_devices’, how this list is built (ordered list, is the list across the host processus are equivalent and have the same order, …) ?
Thanks a lot.
--
Jean-Matthieu Etancelin
PhD Student
Laboratoire Jean Kuntzmann
Université de Grenoble-Alpes
France
Greetings,
Im trying to use pyopencl enqueue_fill_buffer but get the following error:
File "/usr/local/lib/python2.7/dist-packages/pyopencl/__init__.py", line 1169, in enqueue_fill_buffer
return _cl.enqueue_fill_buffer(queue, mem, pattern, offset,
AttributeError: 'module' object has no attribute ‘enqueue_fill_buffer'
I believe pyopencl/__init__.py line 1169 should be: 'return _cl._enqueue_fill_buffer(...‘ instead of 'return _cl.enqueue_fill_buffer(…'
Cheers,
Martin
Hello,
Just a small message to tell you the "beignet" opencl driver has been
released a bit earlier this week (version 0.3).
This driver is using the GPU integrated in the two last generation of Intel processors.
While I was not able to compile it, the debian team made a package which works. Thanks to them.
In [1]: import pyopencl
In [2]: ctx = pyopencl.create_some_context()
Choose platform:
[0] <pyopencl.Platform 'Intel(R) OpenCL' at 0x259dfc0>
[1] <pyopencl.Platform 'Experiment Intel Gen OCL Driver' at 0x7fee1dc2e020>
[2] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7fee199df520>
Choice [0]:1
Set the environment variable PYOPENCL_CTX='1' to avoid being asked again.
In [3]: queue = pyopencl.CommandQueue(ctx)
In [4]: import pyopencl.array, scipy.misc
In [9]: lgpu=pyopencl.array.to_device(queue, scipy.misc.lena().astype("float32"))
In [10]: inv_lena=255.0-lgpu
In [13]: ilena=inv_lena.get()
In [14]: ilena==255-scipy.misc.lena()
Out[14]:
array([[ True, True, True, ..., False, False, False],
[ True, True, True, ..., False, False, False],
[ True, True, True, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]], dtype=bool)
It is not (yet) perfect but it starts to be useable.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF