Hi all, dear Andreas,
on my macbook pro (running os x 10.11) I have been plagued with nasty segfaults when using the nvidia 750M GPU and the high level methods of pyopencl.array together with complex64 arrays. Ultimately it seems to be due to buggy nvidia OpenCL drivers on os x, but I found a workaround:
The typedef in pyopencl/cl/pyopencl-complex.h for cfloat_t (after macro expansion)
as a
"union {struct {float x,y;}; struct {float real, imag;}}“
is too sophisticated. Same for a simpler „struct {float x,y;}“ But segfaults go away if instead I use "typedef float2 cfloat_t;“ and then replace a.real by a.x (same for .imag -> .y). All tests pass (could not test for complex double). This is not beautiful, but it works for me.
The docs state that the the struct has been introduced to avoid silent bugs, e.g. complex + real not giving the expected result, so I understand that my workaround is not acceptable for a PR.
Does anybody know about other ways to avoid the segfaults?
Gregor
"CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
> I second that the caching should be done internally by opencl, as this
> would produce more readable code and avoid obscure bugs like the opener's.
> Is there any reason against it?
>
> As a general python rule, if you have a @property that takes a
> non-negligible amount of time to compute after the first invocation, you're
> doing it wrong and it should be either cached or replaced by a function.
> Just my 2c.
I generally agree, although this is a bit subtler, since "prg.sum" is
cheap. It's just using "prg.sum(...)" incurs sigificant cost each time.
Nonetheless, yes, I agree that this should be fixed.
https://github.com/pyopencl/pyopencl/issues/112
I'd be happy to take a patch if one of you gets to it before I do.
Andreas
Hi,
I’ve been using pyopencl for awhile for various simulation/data processing tasks. I recently upgraded to a new computer, and noticed things were considerably slower.
After some experimentation, I tracked this down to the version of pyopencl I was using. The updated version (2015.2.4; most recent on pypi) takes significantly longer to queue a function call (~1.5 ms) than the old version (2015.1, ~0.03 ms). Both times come from the same machine*. Profiling indicates that the newer version is making lots of function calls the old version did not. FYI, the code I used to test this is below (adapted from documentation).
For my purposes, this is slightly alarming: my code makes lots of kernel calls, in which case the new version is 50x slower for small data sets!
Is this something that has been/will be fixed in newer versions of pyopencl? Is there a workaround? Of course, for the time being I can use the old version, but I’d rather not be stuck with it.
If needed, I can provide the profiler output.
Thanks,
Dustin Kleckner
*OS X w/ python 3.5 installed via Anaconda, pyopencl installed via pip. Code was tested with GPU and CPU, which similar results.
Test Code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, print_function
import numpy as np
import pyopencl as cl
import time
import cProfile
a_np = np.random.rand(50000).astype(np.float32)
b_np = np.random.rand(50000).astype(np.float32)
# ctx = cl.create_some_context()
device = cl.get_platforms()[0].get_devices(cl.device_type.CPU)[0]
ctx = cl.Context([device])
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a_np)
b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b_np)
prg = cl.Program(ctx, """
__kernel void sum(__global const float *a_g, __global const float *b_g, __global float *res_g) {
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + b_g[gid];
}
""").build()
res_g = cl.Buffer(ctx, mf.WRITE_ONLY, a_np.nbytes)
cProfile.run('''
start = time.time()
for n in range (100): prg.sum(queue, a_np.shape, None, a_g, b_g, res_g)
el = time.time() - start
''')
print('cl version:', cl.VERSION)
print('kernel start time: %.3f ms' % (10*el))
start = time.time()
res_np = np.empty_like(a_np)
cl.enqueue_copy(queue, res_np, res_g)
el = time.time() - start
print('copy time: %.3f ms' % (1E3*el))
On Wed, 3 Feb 2016 10:29:43 -0600
Brian Paterni <bpaterni(a)gmail.com> wrote:
> > One example in the book walks through computing a histogram of an image,
> > but while the example code provided (in C) works and produces the
> > expected output histogram, my attempt to translate the same program into
> > Python raises some question. the kernel runs and produces results that
> > resembles the reference histogram, only values are much, much larger
> > than expected.
Are you not mixing int32 and int64 ?
if you play on linux64, python works with int64 for int while the
OpenCL is likely to consider them as 32bits only.
There's quite a barrier to being able to assist you with this.
Can you reduce the problem to a minimal example of a C implementation
and Python implementation (ideally calling the exact same kernel) that
manifests the problem you're having? It would be best if you just used a
toy kernel to do this that does barely anything.
Experience would suggest this will either point you yourself to the
correct solution or allows someone a much better chance of seeing the
problem and helping you.
Cheers,
Henry
On 03/02/16 16:29, Brian Paterni wrote:
> Re-sending this message, this time while subscribed to the list.
> Hopefully it makes it through this time.
>
> On Fri, Jan 22, 2016 at 3:28 PM, Brian Paterni <bpaterni(a)gmail.com> wrote:
>> Hi,
>>
>> In short, I'm doing some research on OpenCL, and to possibly avoid the
>> verbosity of C or C++, I'm exploring my options with pyopencl. Further,
>> to get up to speed with OpenCL, I'm working through examples in
>> Heterogeneous Computing with OpenCL 2.0.
>>
>> One example in the book walks through computing a histogram of an image,
>> but while the example code provided (in C) works and produces the
>> expected output histogram, my attempt to translate the same program into
>> Python raises some question. the kernel runs and produces results that
>> resembles the reference histogram, only values are much, much larger
>> than expected.
>>
>> The code I'm working with is located at
>>
>> https://github.com/bpaterni/4800.research
>>
>> which contains 2 branches. The Python implementations are located in
>> branch 'pyopencl' whereas C implementations are located in master.
>>
>> I'm curious to know if I've misunderstood some aspect of pyopencl during
>> python reimplementation and would very much appreciate any help with
>> this issue I'm having.
>>
>> Thank You :)
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>
Re-sending this message, this time while subscribed to the list.
Hopefully it makes it through this time.
On Fri, Jan 22, 2016 at 3:28 PM, Brian Paterni <bpaterni(a)gmail.com> wrote:
> Hi,
>
> In short, I'm doing some research on OpenCL, and to possibly avoid the
> verbosity of C or C++, I'm exploring my options with pyopencl. Further,
> to get up to speed with OpenCL, I'm working through examples in
> Heterogeneous Computing with OpenCL 2.0.
>
> One example in the book walks through computing a histogram of an image,
> but while the example code provided (in C) works and produces the
> expected output histogram, my attempt to translate the same program into
> Python raises some question. the kernel runs and produces results that
> resembles the reference histogram, only values are much, much larger
> than expected.
>
> The code I'm working with is located at
>
> https://github.com/bpaterni/4800.research
>
> which contains 2 branches. The Python implementations are located in
> branch 'pyopencl' whereas C implementations are located in master.
>
> I'm curious to know if I've misunderstood some aspect of pyopencl during
> python reimplementation and would very much appreciate any help with
> this issue I'm having.
>
> Thank You :)