Dear all,
we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
training points.
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> pyopencl-2013.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py",
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
Host-Python-File
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem[0], 4 bytes cmem[3]
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
mistake.
We would be very grateful for every hint.
Best regards,
Justin Heinermann,
University Oldenburg
Dear Python/OpenCL community,
I am pretty new (py)opencl and encountered a problem, maybe it a lack of understanding of openCL, but I found strange python seg-faults:
test program:
#!/usr/bin/python
import numpy, pyopencl
ctx = pyopencl.create_some_context()
data=numpy.random.random((1024,1024)).astype(numpy.float32)
img = pyopencl.image_from_array(ctx, ary=data, mode="r", norm_int=False, num_channels=1)
print img
System: debian sid: pyopencl2012.1 (the same code works with debian stable and v2011.2)
Here is the backtrace obtained with GDB:
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff340c253 in pyopencl::create_image_from_desc(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#2 0x00007ffff342de36 in _object* boost::python::detail::invoke<boost::python::detail::install_holder<pyopencl::image*>, pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>, boost::python::arg_from_python<unsigned long>, boost::python::arg_from_python<_cl_image_format const&>, boost::python::arg_from_python<_cl_image_desc&>, boost::python::arg_from_python<boost::python::api::object> >(boost::python::detail::invoke_tag_<false, false>, boost::python::detail::install_holder<pyopencl::image*> const&, pyopencl::image* (*&)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>&, boost::python::arg_from_python<unsigned long>&, boost::python::arg_from_python<_cl_image_format const&>&, boost::python::arg_from_python<_cl_image_desc&>&, boost::python::arg_from_python<boost::python::api::object>&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#3 0x00007ffff342e06f in boost::python::detail::caller_arity<5u>::impl<pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::detail::constructor_policy<boost::python::default_call_policies>, boost::mpl::vector6<pyopencl::image*, pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object> >::operator()(_object*, _object*) ()
from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#4 0x00007ffff311715b in boost::python::objects::function::call(_object*, _object*) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#5 0x00007ffff3117378 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
#6 0x00007ffff3120593 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#7 0x00007ffff3445983 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<pyopencl::error, void (*)(pyopencl::error const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(pyopencl::error const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#8 0x00007ffff3120373 in boost::python::handle_exception_impl(boost::function0<void>) ()
from /usr/lib/libboost_python-py27.so.1.49.0
#9 0x00007ffff3115635 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
Thanks for your help.
If you are not able to reproduce this bug, I should mention it to debian.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF
Dear Andreas,
I am currently working on a cython based wrapper for the OpenCL FFT library from AMD: https://github.com/geggo/gpyfft
For this I need to create a pyopencl Event instance from a cl_event returned by the library. I attached a patch against recent pyopencl that adds this possibility, similar to the from_cl_mem_as_int() method of the MemoryObject class. Could you please add this to pyopencl.
Thanks for your help
Gregor
Sorry if there are two copies of this message.
I have sent it to the list but received no confirmation
(nor any error) and checked that archive does not show
any message from January.
I can see that there is already new version (2013.1) in docs,
marked "in development". I would like for it not to be released
before fixing problems with parallel prefix scan.
Problems with scan are only visible on APU Loveland. They do not
occur on ION, nor on GTX 460. I do not have access to machine
with NVIDIA CC 3.x so I cannot test prefix scan there.
I first encountered it in August, and mentioned them in email
to the list from 2012-08-08 ("Python3 test failures").
Only recently I had some time and eagerness to look closer into them.
Tests still fail on recent git version c31944d1e81a.
Failing tests are now in test_algorithm.py, in third group (marked
scan-related, starting in line 418). I'll describe my observations
of test_scan function.
My APU has 2 Computing Units. GenericScanKernel chooses
k_group_size to be 4096, max_scan_wg_size to be 256,
and max_intervals to 6.
The first error occurs when there is enough work to fill two Computing
Units - in my case 2**12+5. It looks like there is problem with passing
partial result from computations occurring on fist CU to the second one.
Prefix sum is computed correctly on the second half of the array but
starting with the wrong point. I have printed interval_results array
and I have observed that error (difference between the correct value
of the interval's first element and actual one) is not the value
of any of the elements of interval_results, nor it is difference
between interval_results elements. On the other hand difference
between real and wanted value is similar (i.e. in the same range)
to the difference between interval_results[4] and interval_results[3].
In the test I have run just now the error is 10724571 and
the difference is 10719275; I am not sure if this is relevant though.
Errors are not repeatable - sometimes they occur for small arrays
(e.g. for 2**12+5) sometimes for larger ones (test I have run
right now failed for ExclusiveScan of size 2**24+5). The tests'
failures also depend on order of tests - after changing order of
elements of array scan_test_counts I got failures for different
sizes, but always for sizes larger than 2**12. It might be
some race condition, but I do not understand new scan fully
and cannot point my finger at one place.
If there is any additional test I can perform please let me know.
I'll try to investigate it further but I am not sure whether
it'll work.
Best regards.
--
Tomasz Rybak GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
Ariel Guerreiro <arielguerreiro(a)gmail.com> writes:
> I have become interested in heterogeneous computation, specially
> OpenCl and Cuda (and their python counterparts). However, I have
> noticed the lack of more advanced numerical packages (something
> equivalent to scipy or numpy) or at least I haven't found anything
> that good and user friendly. Currently I am looking for a way of doing
> singular value decomposition for very large matrices (sparse and not
> sparse, with focus on not sparse). Do you have any idea where I could
> get something ready-made that works well. I am trying to avoid doing
> the deed myself for I am not certain of the best algorithm and I am
> not a computer programming wizard, just your plain python and scipy
> user.
There's currently no such thing for PyOpenCL. scikits.cuda exists for PyCUDA.
If you could be convinced to do some work on this, cooking up a wrapper
for Magma's linear algebra capabilities would likely be a good starting
point:
http://icl.cs.utk.edu/magma/
I've cc'd the mailing list. Please send future requests of this nature
directly there. Thanks.
Andreas
Hi Sijiang,
Sijiang Chen <sc30.ucc(a)gmail.com> writes:
> The output of ls /etc/OpenCL/vendors is nvidia.icd. My graphic card is
> notebook version GT525M. Does that mean all I need to do is download and
> install ubuntu version of CUDA toolkit from nvidia and the problem will be
> solved? Thank you for help.
I think that might be worth trying.
Andreas
Hi Sijang,
Sijiang Chen <sc30.ucc(a)gmail.com> writes:
> I had a chat with Antonio Beamud and he helped me on how to set up pyopencl
> installation on ubuntu(This installation is based on a clean ubuntu and
> addition drivers(nvidia) installed).
>
> *sudo apt-get install python-pyopencl
> python /usr/share/doc/python-pyopencl/examples/benchmark-all.py*
>
> When I type the second command, and error occurs.
>
> *Execution time of test without OpenCL: 9.42699384689 s
> Traceback (most recent call last):
> File "/usr/share/doc/python-pyopencl/examples/benchmark-all.py", line 24,
> in <module>
> for platform in cl.get_platforms():
> pyopencl.LogicError: clGetPlatformIDs failed: invalid/unknown error code*
>
> Antonio's card is ATI and he had no problem with it, mine is Nvidia driver
> and it is installed by additional drivers, below is the additional drivers
> screenshot:[image: Inline images 1]
> By any chance, is there anyone could help me to solve this? Thanks.
Which OpenCL library did you install? (In Debian, there are
amd-libopencl1, nvidia-libopencl1, and a few more. Not sure how Ubuntu
works in this regard.)
What's the output of
ls /etc/OpenCL/vendors/
?
Andreas
I had a chat with Antonio Beamud and he helped me on how to set up pyopencl
installation on ubuntu(This installation is based on a clean ubuntu and
addition drivers(nvidia) installed).
*sudo apt-get install python-pyopencl
python /usr/share/doc/python-pyopencl/examples/benchmark-all.py*
When I type the second command, and error occurs.
*Execution time of test without OpenCL: 9.42699384689 s
Traceback (most recent call last):
File "/usr/share/doc/python-pyopencl/examples/benchmark-all.py", line 24,
in <module>
for platform in cl.get_platforms():
pyopencl.LogicError: clGetPlatformIDs failed: invalid/unknown error code*
Antonio's card is ATI and he had no problem with it, mine is Nvidia driver
and it is installed by additional drivers, below is the additional drivers
screenshot:[image: Inline images 1]
By any chance, is there anyone could help me to solve this? Thanks.
Regards
Sijiang CHen
Hi Calle,
Calle Snickare <problembarnet(a)gmail.com> writes:
> Is my question too hard or too trivial? I can't find any examples or info
> on the web on this. I'll try to reformulate:
>
> I want to run two different kernel functions in succession, with the same
> variables/input. They are both written in the same c-file. This is what I
> run in the host-code:
>
> kernel_1 = prg.function_1
> kernelObj_1= kernel_1(queue, globalSize, localSize, ins.data, ranluxcltab)
> kernelObj_1.wait()
>
> kernel_2 = prg.function_2
> kernelObj_2 = kernel_2(queue, globalSize, localSize, ins.data, ranluxcltab)
> kernelObj_2.wait()
>
> Is this correct? If so - I'm running out of memory faster than I expect. Is
> the same data really being used in this way, or is it duplicated?
Sorry for the long delay in responding. Yes, these arrays should share
memory. In general, it helps people help you if you pass along the error
message and the hardware and software versions (CL implementation, OS,
etc.) that you're running on.
HTH,
Andreas