Dear Python/OpenCL community,
I am pretty new (py)opencl and encountered a problem, maybe it a lack of understanding of openCL, but I found strange python seg-faults:
test program:
#!/usr/bin/python
import numpy, pyopencl
ctx = pyopencl.create_some_context()
data=numpy.random.random((1024,1024)).astype(numpy.float32)
img = pyopencl.image_from_array(ctx, ary=data, mode="r", norm_int=False, num_channels=1)
print img
System: debian sid: pyopencl2012.1 (the same code works with debian stable and v2011.2)
Here is the backtrace obtained with GDB:
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff340c253 in pyopencl::create_image_from_desc(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#2 0x00007ffff342de36 in _object* boost::python::detail::invoke<boost::python::detail::install_holder<pyopencl::image*>, pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>, boost::python::arg_from_python<unsigned long>, boost::python::arg_from_python<_cl_image_format const&>, boost::python::arg_from_python<_cl_image_desc&>, boost::python::arg_from_python<boost::python::api::object> >(boost::python::detail::invoke_tag_<false, false>, boost::python::detail::install_holder<pyopencl::image*> const&, pyopencl::image* (*&)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>&, boost::python::arg_from_python<unsigned long>&, boost::python::arg_from_python<_cl_image_format const&>&, boost::python::arg_from_python<_cl_image_desc&>&, boost::python::arg_from_python<boost::python::api::object>&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#3 0x00007ffff342e06f in boost::python::detail::caller_arity<5u>::impl<pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::detail::constructor_policy<boost::python::default_call_policies>, boost::mpl::vector6<pyopencl::image*, pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object> >::operator()(_object*, _object*) ()
from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#4 0x00007ffff311715b in boost::python::objects::function::call(_object*, _object*) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#5 0x00007ffff3117378 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
#6 0x00007ffff3120593 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#7 0x00007ffff3445983 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<pyopencl::error, void (*)(pyopencl::error const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(pyopencl::error const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#8 0x00007ffff3120373 in boost::python::handle_exception_impl(boost::function0<void>) ()
from /usr/lib/libboost_python-py27.so.1.49.0
#9 0x00007ffff3115635 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
Thanks for your help.
If you are not able to reproduce this bug, I should mention it to debian.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF
Dear Andreas,
I am currently working on a cython based wrapper for the OpenCL FFT library from AMD: https://github.com/geggo/gpyfft
For this I need to create a pyopencl Event instance from a cl_event returned by the library. I attached a patch against recent pyopencl that adds this possibility, similar to the from_cl_mem_as_int() method of the MemoryObject class. Could you please add this to pyopencl.
Thanks for your help
Gregor
Sorry if there are two copies of this message.
I have sent it to the list but received no confirmation
(nor any error) and checked that archive does not show
any message from January.
I can see that there is already new version (2013.1) in docs,
marked "in development". I would like for it not to be released
before fixing problems with parallel prefix scan.
Problems with scan are only visible on APU Loveland. They do not
occur on ION, nor on GTX 460. I do not have access to machine
with NVIDIA CC 3.x so I cannot test prefix scan there.
I first encountered it in August, and mentioned them in email
to the list from 2012-08-08 ("Python3 test failures").
Only recently I had some time and eagerness to look closer into them.
Tests still fail on recent git version c31944d1e81a.
Failing tests are now in test_algorithm.py, in third group (marked
scan-related, starting in line 418). I'll describe my observations
of test_scan function.
My APU has 2 Computing Units. GenericScanKernel chooses
k_group_size to be 4096, max_scan_wg_size to be 256,
and max_intervals to 6.
The first error occurs when there is enough work to fill two Computing
Units - in my case 2**12+5. It looks like there is problem with passing
partial result from computations occurring on fist CU to the second one.
Prefix sum is computed correctly on the second half of the array but
starting with the wrong point. I have printed interval_results array
and I have observed that error (difference between the correct value
of the interval's first element and actual one) is not the value
of any of the elements of interval_results, nor it is difference
between interval_results elements. On the other hand difference
between real and wanted value is similar (i.e. in the same range)
to the difference between interval_results[4] and interval_results[3].
In the test I have run just now the error is 10724571 and
the difference is 10719275; I am not sure if this is relevant though.
Errors are not repeatable - sometimes they occur for small arrays
(e.g. for 2**12+5) sometimes for larger ones (test I have run
right now failed for ExclusiveScan of size 2**24+5). The tests'
failures also depend on order of tests - after changing order of
elements of array scan_test_counts I got failures for different
sizes, but always for sizes larger than 2**12. It might be
some race condition, but I do not understand new scan fully
and cannot point my finger at one place.
If there is any additional test I can perform please let me know.
I'll try to investigate it further but I am not sure whether
it'll work.
Best regards.
--
Tomasz Rybak GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
Hi Antonio,
Antonio Rieser <tonyrieser(a)gmail.com> writes:
> I just built and installed pyopencl from source on an old Dell
> Inspiron 1440 running Fedora 18 with the Intel SDK OpenCL
> implementation and Python version 3.3. I believe the installation went
> ok (after a few hiccups), but I now get a "device not available"
> runtime error when I try to run benchmark.py (as well as any other
> example program that does not give me a syntax error), and I don't see
> how to resolve it. The output of "python3.3 benchmark.py" is:
>
> Execution time of test without OpenCL: 0.17273521423339844 s
> ===============================================================
> Platform name: Intel(R) OpenCL
> Platform profile: FULL_PROFILE
> Platform vendor: Intel(R) Corporation
> Platform version: OpenCL 1.2 LINUX
> ---------------------------------------------------------------
> Device name: Intel(R) Core(TM)2 Duo CPU T6600 @ 2.20GHz
> Device type: CPU
> Device memory: 2974 MB
> Device max clock speed: 2200 MHz
> Device compute units: 2
> Device max work group size: 1024
> Device max work item sizes: [1024, 1024, 1024]
> Traceback (most recent call last):
> File "benchmark.py", line 46, in <module>
> ctx = cl.Context([device])
> pyopencl.RuntimeError: Context failed: device not available
Is your CPU new enough to be supported by the Intel CL SDK? If not, try
the AMD one--this is a bit more compatible with old CPUs.
HTH,
Andreas
Is everything OK with git repositories on git.tiker.net?
When I tried to pull changes I got
$ git pull
From http://git.tiker.net/trees/pycuda
+ 2e20f2c...00cf720 master -> origin/master (forced update)
Already up-to-date.
Cloning PyCUDA repository gives last change from 2011-12-07:
commit 00cf72000946edc7b6f684748be9a888e3aef20e
Author: Andreas Kloeckner <inform(a)tiker.net>
Date: Wed Dec 7 10:23:20 2011 -0500
Bump version (for CUDA 3.1 fix), release.
Status on pyopencl gives:
$ git status
# On branch master
# Your branch is ahead of 'origin/master' by 544 commits.
#
nothing to commit (working directory clean)
At the same time repositories on github have changes from Feb 2013.
What's going on? As I have git.tiker.net mentioned as master
repository in my packages, should I change it to github?
Best regards.
--
Tomasz Rybak <tomasz.rybak(a)post.pl> GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
Hello!
I just built and installed pyopencl from source on an old Dell
Inspiron 1440 running Fedora 18 with the Intel SDK OpenCL
implementation and Python version 3.3. I believe the installation went
ok (after a few hiccups), but I now get a "device not available"
runtime error when I try to run benchmark.py (as well as any other
example program that does not give me a syntax error), and I don't see
how to resolve it. The output of "python3.3 benchmark.py" is:
Execution time of test without OpenCL: 0.17273521423339844 s
===============================================================
Platform name: Intel(R) OpenCL
Platform profile: FULL_PROFILE
Platform vendor: Intel(R) Corporation
Platform version: OpenCL 1.2 LINUX
---------------------------------------------------------------
Device name: Intel(R) Core(TM)2 Duo CPU T6600 @ 2.20GHz
Device type: CPU
Device memory: 2974 MB
Device max clock speed: 2200 MHz
Device compute units: 2
Device max work group size: 1024
Device max work item sizes: [1024, 1024, 1024]
Traceback (most recent call last):
File "benchmark.py", line 46, in <module>
ctx = cl.Context([device])
pyopencl.RuntimeError: Context failed: device not available
Any help is heartily appreciated.
Best regards,
Antonio
Hi all again:
I've created a test to measure performance... You can test it, the code
is here:
http://pastebin.com/Nye5Axm8
I'm using two arrays and the results, only changing the size of the
first only in two elements, give me a lot of performance loss in the gpu
only.
Time for ASIZE: 29120 [GPU]: 0.296602 s
Time for ASIZE: 29120 [CPU]: 3.12564 s
Time for ASIZE: 29122 [GPU]: 11.2411 s
Time for ASIZE: 29122 [CPU]: 3.13552 s
Why this difference in the gpu performance?
I'm using pyopencl 0.92 (Ubuntu 12.04 version), and my graphic card is a
Radeon HD6450.
A lot of thanks.
Hi:
I was testing using a result array of booleans, the problem is accessing
by index to the correct element... What's the correct way of doing this?
For example, with this buffers declarations:
a = numpy.random.randint(-100000000, 100000000, 10).astype(numpy.int32)
r = numpy.zeros(10, numpy.bool)
a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY | mf.COPY_HOST_PTR, hostbuf=r)
and this kernel:
__kernel void test(__global const int *a,
__global bool *c
)
{
int agid = get_global_id(0);
if (agid == 1)
c[agid] = true;
}
The c buffer contains the next values (initialized to false 10 elements):
[False False False False True False False False False False]
i.e. is using 1 * 4 (int size, 4 bytes) as index value...
I'm a little confused :-)
Hello,
Is my question too hard or too trivial? I can't find any examples or info
on the web on this. I'll try to reformulate:
I want to run two different kernel functions in succession, with the same
variables/input. They are both written in the same c-file. This is what I
run in the host-code:
kernel_1 = prg.function_1
kernelObj_1= kernel_1(queue, globalSize, localSize, ins.data, ranluxcltab)
kernelObj_1.wait()
kernel_2 = prg.function_2
kernelObj_2 = kernel_2(queue, globalSize, localSize, ins.data, ranluxcltab)
kernelObj_2.wait()
Is this correct? If so - I'm running out of memory faster than I expect. Is
the same data really being used in this way, or is it duplicated?
Cheers, Calle
On Thu, Jan 31, 2013 at 9:11 AM, Calle Snickare <problembarnet(a)gmail.com>wrote:
> Hello again,
> I checked again and I need to reduce my number of threads run by a factor
> 4 not to get "out of memory" error. This seems very strange since the idea
> is that I want to use the same memory for seeds etc when running the
> initialize kernel as running my main kernel. Is there something wrong in my
> kernel invocation?
>
> Cheers, Calle
>
>
> On Mon, Jan 28, 2013 at 3:40 PM, Calle Snickare <problembarnet(a)gmail.com>wrote:
>
>> Hello,
>> I am currently trying to implement Ranlux in one of my programs. My
>> kernel will be re-run several times with the same seeds, so I don't want to
>> include the Ranlux initialization in it as I only want to do this once
>> (right?). I also want to make sure to use the same memory between the runs.
>> So I figure that I solve this by having two kernels: one kernel that
>> initializes Ranlux (run this once at the beginning), as well as my "main"
>> kernel. They will both be written in the same c-file.
>>
>> Here is some of the code. At first I had some strange errors getting it
>> to work. Now I can get it to run, but it feels like it runs out of memory
>> quicker than it should. Am I approaching this the wrong way?
>>
>>
>> Host code:
>> ctx = cl.create_some_context()
>> queueProperties = cl.command_queue_properties.PROFILING_ENABLE
>> queue = cl.CommandQueue(ctx, properties=queueProperties)
>>
>> mf = cl.mem_flags
>> dummyBuffer = np.zeros(nbrOfThreads * 28, dtype=np.uint32)
>> ins = cl.array.to_device(queue, (np.random.randint(0, high = 2 ** 31 - 1,
>> size = (nbrOfThreads))).astype(np.uint32))
>> ranluxcltab = cl.Buffer(ctx, mf.READ_WRITE, size=0, hostbuf=dummyBuffer)
>>
>> kernelCode_r = open(os.path.dirname(__file__) + 'ranlux_test_kernel.c',
>> 'r').read()
>> kernelCode = kernelCode_r % replacements
>>
>> prg = (cl.Program(ctx, kernelCode).build(options=programBuildOptions))
>>
>> kernel_init = prg.ranlux_init_kernel
>> kernelObj_init = kernel_init(queue, globalSize, localSize, ins.data,
>> ranluxcltab)
>>
>> kernelObj_init.wait()
>>
>> kernel = prg.ranlux_test_kernel
>> kernelObj = kernel(queue, globalSize, localSize, ins.data, ranluxcltab)
>> kernelObj.wait()
>>
>> Kernel Code:
>> #pragma OPENCL EXTENSION cl_khr_fp64 : enable
>> #define RANLUXCL_SUPPORT_DOUBLE
>> #include "pyopencl-ranluxcl.cl" // Ranlux source-code
>> #define RANLUXCL_LUX 4
>>
>> __kernel void ranlux_init_kernel(__global uint *ins, __global
>> ranluxcl_state_t *ranluxcltab)
>> {
>> //ranluxclstate stores the state of the generator.
>> ranluxcl_state_t ranluxclstate;
>>
>> ranluxcl_initialization(ins, ranluxcltab);
>> }
>>
>> __kernel void ranlux_test_kernel(__global uint *ins, __global
>> ranluxcl_state_t *ranluxcltab)
>> {
>> uint threadId = get_global_id(0) + get_global_id(1) *
>> get_global_size(0);
>>
>> //ranluxclstate stores the state of the generator.
>> ranluxcl_state_t ranluxclstate;
>>
>> //Download state into ranluxclstate struct.
>> ranluxcl_download_seed(&ranluxclstate, ranluxcltab);
>>
>> double randomnr;
>> randomnr = ranluxcl64(&ranluxclstate);
>> /* DO STUFF */
>>
>>
>> //Upload state again so that we don't get the same
>> //numbers over again the next time we use ranluxcl.
>> ranluxcl_upload_seed(&ranluxclstate, ranluxcltab);
>> }
>>
>>
>> Cheers,
>> Calle
>>
>
>