Dear Andreas,
I am currently working on a cython based wrapper for the OpenCL FFT library from AMD: https://github.com/geggo/gpyfft
For this I need to create a pyopencl Event instance from a cl_event returned by the library. I attached a patch against recent pyopencl that adds this possibility, similar to the from_cl_mem_as_int() method of the MemoryObject class. Could you please add this to pyopencl.
Thanks for your help
Gregor
Hi all,
now that we have a flexible scan, a lot of stuff becomes quite easy:
http://documen.tician.de/pyopencl/array.html#sorting
:)
Performance isn't a dream yet, but I've also done exactly zero
tuning. It manages 34 MKeys/s on Fermi and 42 MKeys/s on Tahiti. For
comparison, numpy does about 10 MKeys/s on a CPU with a decent memory
system. The CL code on the CPU achieves about 10 MKeys/s on 4+ cores,
with the AMD implementation being 50% faster than Intel. (All this is on
32-bit integers.) If you've got some time to help tune this... :P
But the real good news here is that a) this was pretty easy to put
together on top of the existing scan primitive, and b) it actually
yields code that works on quite a bunch of CL implementations.
Hope you're finding this as exciting as me. :)
Andreas
Hi Bogdan,
Bogdan Opanchuk <mantihor(a)gmail.com> writes:
>> This is a bit mystifying. Where does that config option come from in
>> your case? Can you check your siteconf.py or ~/.aksetup-defaults.py? If
>> it's not coming from there, the default should pass the correct, unsplit
>> (i.e. with comma) version to distutils. You can check this by putting
>>
>> print conf["LDFLAGS"]
>>
>> into setup.py before the call to setup(). If the correct thing is passed
>> to distutils and it still comes out split, then that's a distutils
>> issue. Otherwise, please report back, and we'll investigate.
>
> It seems that the incorrect value is indeed coming from distutils. If
> I remove siteconf.py and just run "setup.py build", the flags are
> correct (and siteconf.py is not created).
Actually, this result means exactly the opposite--it means that
distutils is innocent, and that something in PyOpenCL screws it up. Can
you please run ./configure.py and check what ends up in siteconf.py?
Thanks,
Andreas
Hi there,
J Diviney <justdivs(a)gmail.com> writes:
> I've already posted this to stackoverflow, but then remembered that this
> mailing list exists, so apologies if any of you have read it twice already.
no worries, all good.
> I'm moving a simulation into pyOpenCL and can't get my data access to work.
> I'm trying to supply a 1D array of vectors (well, actually several arrays,
> but the example I've included just uses one).
>
> Currently, the first few vectors are copied over just fine, but then the
> data is simply not what I supplied.
One possible problem that I see in your code is that float3s are
actually laid out as four floats in memory. (You can see this using
sizeof() in your kernel.) When dealing with vectors such as this, it's
often most convenient to use pyopencl.array.vec.float3 as the dtype.
HTH,
Andreas
Hi,
I've already posted this to stackoverflow, but then remembered that this
mailing list exists, so apologies if any of you have read it twice already.
I'm moving a simulation into pyOpenCL and can't get my data access to work.
I'm trying to supply a 1D array of vectors (well, actually several arrays,
but the example I've included just uses one).
Currently, the first few vectors are copied over just fine, but then the
data is simply not what I supplied.
I don't think I've posted here before, so apologies if any of the
formatting/presentation is wrong. Also, I've just stripped out all the
simulation code, so I realise this code is currently not actually doing
anything, I just want to get the buffer passing correct.
Thanks in advance.
I've attached the kernel and the file that I'm calling it with.
Hi all,
if you've started using the new segmented scan implementation, you should make
note of a change that I've just made. Now, your scan_expr is responsible
for implementing segmentation. The scan routine will provide a bool flag
`across_seg_boundary` to indicate whether the scan update is taking
place across a segment boundary. The meaning of the two arguments `a`
and `b` has also been clarified: `a` is an increment, and `b` is the
value being incremented, potentially `across_seg_boundary`.
To keep existing code working, simply wrap your scan expression in
across_seg_boundary ? b : (original_scan_expr)
Since this is all still unreleased code, there are no facilities in
place to keep the old usage working. This became necessary in my use of
this code, where I wanted to do a segmented scan on some part of the
data and an unsegmented scan on another. This is now easily possible.
This is all also documented here:
http://documen.tician.de/pyopencl/array.html#making-custom-scan-kernels
Sorry for the incompatible change. I'd also welcome feedback.
Next, if you're writing scan code, this might come in handy:
http://documen.tician.de/pyopencl/array.html#debugging-aids
It's a completely sequential scan kernel generator/runner, best run on a
CPU. It's meant to help isolate concurrency-related bugs from bugs in
the code snippets passed to scan. Its interface is exactly the same as
that of the parallel kernel generator/runner.
Andreas
Hi Andreas,
On Sun, Jul 29, 2012 at 6:25 AM, Andreas Kloeckner
<lists(a)informa.tiker.net> wrote:
> Argh. That function is part of the CL 1.2 spec. Apple seems to be
> advertising 1.2 support, has that function in its CL headers, but
> doesn't export it, it seems. Good job. :)
>
> Assuming what I just said is true, there's now a workaround for this in git.
> Can you please try it out and report back?
Yes, now it works fine. Thanks!
> This is a bit mystifying. Where does that config option come from in
> your case? Can you check your siteconf.py or ~/.aksetup-defaults.py? If
> it's not coming from there, the default should pass the correct, unsplit
> (i.e. with comma) version to distutils. You can check this by putting
>
> print conf["LDFLAGS"]
>
> into setup.py before the call to setup(). If the correct thing is passed
> to distutils and it still comes out split, then that's a distutils
> issue. Otherwise, please report back, and we'll investigate.
It seems that the incorrect value is indeed coming from distutils. If
I remove siteconf.py and just run "setup.py build", the flags are
correct (and siteconf.py is not created).
Hi all,
Has anyone tried to use PyOpenCL with OSX10.8? For me the compilation
is successful, but I'm getting the following error:
Python 2.7.3 (default, Jul 28 2012, 10:31:30)
[GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyopencl
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/bogdan/.virtualenvs/py2/lib/python2.7/site-packages/pyopencl/__init__.py",
line 4, in <module>
import pyopencl._cl as _cl
ImportError: dlopen(/Users/bogdan/.virtualenvs/py2/lib/python2.7/site-packages/pyopencl/_cl.so,
2): Symbol not found: _clCreateProgramWithBuiltInKernels
Referenced from:
/Users/bogdan/.virtualenvs/py2/lib/python2.7/site-packages/pyopencl/_cl.so
Expected in: flat namespace
in /Users/bogdan/.virtualenvs/py2/lib/python2.7/site-packages/pyopencl/_cl.so
OpenCL framework seems to be connected to the _cl.so:
$ otool -L /Users/bogdan/.virtualenvs/py2/lib/python2.7/site-packages/pyopencl/_cl.so
/Users/bogdan/.virtualenvs/py2/lib/python2.7/site-packages/pyopencl/_cl.so:
/System/Library/Frameworks/OpenCL.framework/Versions/A/OpenCL
(compatibility version 1.0.0, current version 1.0.0)
/usr/lib/libstdc++.6.dylib (compatibility version 7.0.0, current
version 56.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 169.3.0)
There is also a minor issue during the compilation: clang does not
quite understand the sequence "-Wl -framework OpenCL", and treats
"-Wl" as if it were a warning-related option. This can be fixed by
replacing '-Wl', '-framework', 'OpenCL' by '-Wl,-framework,OpenCL' in
siteconf.py (or, better, somewhere in setup.py), but it does not help
with the symbol problem.
Hi all,
I'm happy to announce that PyOpenCL git finally has a (more) competent
scan implementation. Documentation here:
http://documen.tician.de/pyopencl/array.html#module-pyopencl.scan
Here are some feature bullets for your enjoyment:
- Segmented scan
- Look-behind on input (allows comparisons)
- Look-behind on output (allows exclusive scan, stream compaction)
- Allows transformed scan (aka map-scan)
- Arbitrary argument signatures/many input/output arrays
- Customizable via code snippets
The following operations are already implemented:
- copy_if (i.e. filter/stream compaction)
- remove_if (i.e. filter/stream compaction)
- partition
- unique
Each only took a couple of lines to implement using the generic scan
kernel.
If you've got feedback (especially before this goes into a release), I'd
be happy to hear it. I'm planning to release this as part of 2012.2,
soon.
Andreas
Hi Frédéric,
On Thu, Jul 19, 2012 at 8:58 AM, Frédéric Bastien <nouiz(a)nouiz.org> wrote:
> How much useful it is to abstract between PyCUDA
> and PyOpenCL? Personnaly, I probably won't use that part, but I want
> to abstract between CUDA and OpenCL.
It was either that or to write almost identical libraries for PyCuda
and PyOpenCL. The user is by no means forced to use cluda context, it
is just a wrapper around PyCuda/PyOpenCL context that is passed to
computation classes. After that corresponding array classes and
synchronization functions from the target APIs can be used. But I
personally use this abstraction because I get better GPU performance
with CUDA, but I also need my programs to run on CPU with reasonable
speed (mostly, for debugging), and wor that OpenCL is preferable.
> I like the idea of making code generator that do transformation on the
> input before doing other computation. This is something I wanted
> Theano code generator to do, but I never got time to implement it.
> What the current parameter derive_s_from_lp and derive_lp_from_s mean?
These are two functions that are used to derive types. "s_from_lp"
means "stores from loads and parameters" (transformations can also
have scalar parameters, currently it can be seen in
test_matrixmul.py:test_preprocessing).
In the example the transformation is connected to "input pipeline",
i.e. to the input parameter "A". So it takes some values from new
external input parameters "A_re" and "A_im" (load=2) and combines them
into "A" (store=1). The derivation functions work in the following
situations:
1) When the object needs to derive types for the external signature
from its basis parameters, derive_lp_from_s() is called and supplied
with the type for "A", and expected to return types for "A_re" and
"A_im" (also for parameters, but there are none, so the empty list is
returned).
2) When the object needs to derive basis parameters from the arrays
supplied to prepare_for(), derive_s_from_lp() is called which performs
derivation in the other direction, producing data type for "A" from
data types of "A_re" and "A_im".
There are, in fact, four types of derivation functions. If we had a
transformation connected to the "output pipeline", we would need
derive_l_from_sp() and derive_sp_from_l().
Now that I think about it, "load" and "store" should probably be
called "input" and "output", it is less ambiguous :)
> Also the code section is not something I call readable... Is is only
> because I never used Mako? Andreas, I think you used mako, do you find
> this redable?
These are only one-liners, so, yes, they seem fine to me. Perhaps,
they would be more readable if I did not try to save on spacing, and
used some temporary variables? Also, do you find kernels (dummy.mako
and matrixmul.mako) hard to read too?
> I'm not sure that forcing people to use Mako is a good idea. Can we do
> without it?
The amount of Mako in the transformations can be reduced if I use
macros, for example, "LOAD1" instead of "${load.l1}". The latter was
chosen because I planned to add some error checking, which would be
less comprehensible with macros. It is much harder to avoid Mako for
non-trivial tasks though, like getting corresponding complex data type
for the real one, or providing support for multiplication of all
combinations of complex/real numbers. Is it really such an issue? It
is just Python code in curly braces, after all.
> I still think that we need to provide the user not just with a common
> gpu nd array object. We need to also provide fonctions on it. But I'm
> not sure how we should do this.
My project does not intersect with "general ndarray" idea at all. If
there was some general array class instead of cl.Array/GPUArray,
things would only make things easier for me. I just think that numpy
way of providing functions for arrays is not that good in GPU case, as
the overhead here is more significant, and should be made explicit and
manageable for the user.