Hi all,
thanks to the hard work of Marko Bencun and Yichao Yu, the next version
of PyOpenCL will be substantially different internally from the previous
one. In particular, the wrapper will no longer be built using
Boost.Python but instead using cffi 1.0's ahead-of-time mode. One main
consequence of this is that PyOpenCL now works on Pypy.
This new code is now on the git master branch. (It used to live on the
'cffi' branch. The old Boost wrapper is now on the
'deprecated-boost-python' branch.)
From a user's perspective, nothing should have changed--on all machines
I have access to, PyOpenCL passes the same tests as before, on any
Python version more recent than 2.6, including Pypy. Nonetheless, before
I go ahead and release a new PyOpenCL based on this code, I'd like to
get as many of you as I can to try it and report back. If you package
PyOpenCL, or if you have a Mac or a Windows machine, I'd especially like
to hear from you.
Thanks!
Andreas
Hi all,
I'm writing about PyOpenCL's support for complex numbers. As of right
now, PyOpenCL's complex numbers are typedefs of float2 and double2,
which makes complex+complex addition and real*complex addition match the
desired semantics, but lots of other operators silently do the wrong
thing, such as
- complex*complex
- real+complex
I've come to regard this as a major flaw, and I can't count the number
of times I've had to hunt bugs related to this, and so I'd like to get
rid of it. I've thought about ways of doing this in a
backward-compatible manner, and they all strike me as flawed, and so I'd
prefer to move to a simple struct (supporting both .real and .imag as
well as the old .x and .y members) in one big change.
If you have code depending on PyOpenCL's complex number support and are
opposed to this change, please speak up now. I'll make the change in git
today to give you a preview.
What do you think?
Andreas
> Am 02.07.2015 um 09:39 schrieb Bogdan Opanchuk <bogdan(a)opanchuk.net>:
>
> (did not CC to the mail list by mistake)
>
> Hi Andreas,
>
> I tried to compile & install it on OSX 10.10.4, default clang (from Xcode 6.4) and Pythons 2.7.9, 3.4.3, and pypy-2.6.0 (installed via pyenv). Strangely enough, I have not encountered the problem Gregor reported — PyOpenCL compiles successfully and seems to work fine with my programs.
>
Hi,
as a follow up to my initial report, I could succesfully build recent pyopencl with Python 2.7.6 from python.org <http://python.org/> (now on os 10.10.4), but not with Anaconda Python 2.7. I opened an issue https://github.com/ContinuumIO/anaconda-issues/issues/373 <https://github.com/ContinuumIO/anaconda-issues/issues/373> for anaconda, perhaps there someone knows how to resolve this.
Gregor
Hi All,
First, thank you so much for your hard work on the package.
It is a fantastic resource.
A bit of context:
My lab and I are working on building approximations
for Bayesian nonparametric models implemented in
Python using pyopencl (as far as I can tell, none exist).
So far our package is in its infancy,
but hope it will be useful for others in about a year.
Problem:
One operation that arises frequently is reducing over
a single axis of a multi-dimensional array
(for example, lets say we have the log probability
calculated for each element of a N x D observation matrix
and want to sum over D to get the log probability of each object
-- more frequently I have 3-d arrays
and am summing over one axis of it).
Is it possible to use clarray's reduce sum
capabilities to sum over a single axis?
Currently, I wrote my own for a simple 3-d case,
but it's not nearly as robust as the one provided in
clarray.
Thank you in advance,
Joseph Austerweil
Brown University
Assistant Professor of Cognitive, Linguistic, and Psychological Sciences
joseph_austerweil(a)brown.edu
I realize the hardware are different vintage. I am really concerned with how AMD compares to NVIDIA specifically with global atomic operations. In the initial profiling and researching I had done the culprit appeared to be the global atomic addition i.e. the conclusion was AMD is slower than NVIDIA when doing global atomic addition period. BUT, crusaderky's comment about the 32 vs 64 threads being held up during operations got me thinking. So I dug in to the NVIDIA Profiler and found that about 30% of the threads were idle and most were idle in a search function. I had originally written this search function using a brute force method. Now I have adjusted it to use a binary search. AND, I think crusaderky hit the nail on the head and pushed it through the board. The times I get now are a lot faster. The AMD card showed an ~10x speed up (344 seconds down to 39 seconds) and the NVIDIA card showed an ~5x speed up (11 seconds down to 2.5 seconds). Since the speed up is double for the AMD card I infer it was holding up 64 threads in this search function while the NVIDIA card was holding up only 32 threads. Removing this hold up is the key, thanks crusaderky! I have yet to do much with the memory access. That is the next task.
This does set my mind at ease concerning buying a laptop with an AMD card or an NVIDIA card, either will probably due in the long run as long as I keep my algorithms efficient. ☺
Thanks again
Reese
From: William Shipman [mailto:shipman.william@gmail.com]
Sent: Monday, August 24, 2015 3:10 PM
To: Joe Haywood
Cc: Pyopencl
Subject: Re: [PyOpenCL] Opinions
Just thought I should point that the FirePro V4800 is 3 years older than the GTX 780 Ti and has far fewer cores. Its bandwidth to global memory is 57.6 GB/s vs the 780 Ti's 336.5 GB/s. Comparing the two is pointless, the FirePro V4800 will always lose.
On 14 August 2015 at 19:12, CRV§ADER//KY <crusaderky(a)gmail.com<mailto:crusaderky@gmail.com>> wrote:
Look up opencl / cuda coalesced memory access on stack overflow, there's plenty of threads there
Confidentiality Notice:
This e-mail, including any attachments is the property of Trinity Health and is intended for the sole use of the intended recipient(s). It may contain information that is privileged and confidential. Any unauthorized review, use, disclosure, or distribution is prohibited. If you are not the intended recipient, please delete this message, and reply to the sender regarding the error in a separate email.
Just thought I should point that the FirePro V4800 is 3 years older than
the GTX 780 Ti and has far fewer cores. Its bandwidth to global memory is
57.6 GB/s vs the 780 Ti's 336.5 GB/s. Comparing the two is pointless, the
FirePro V4800 will always lose.
On 14 August 2015 at 19:12, CRV§ADER//KY <crusaderky(a)gmail.com> wrote:
> Look up opencl / cuda coalesced memory access on stack overflow, there's
> plenty of threads there
> On 14 Aug 2015 13:55, "Joe Haywood" <haywoojr(a)mercyhealth.com> wrote:
>
>> Will you explain this to me a little more?
>> "One that jumps to the eye is that you're accessing 4 bytes of memory in
>> an arbitrary place, but every time you're really loading up, and then
>> writing back, a whole page! That's why it's so slow, even without atomic
>> operations. The solution is local memory."
>>
>> Sent from my Samsung Galaxy Tab® S
>>
>> Confidentiality Notice:
>> This e-mail, including any attachments is the property of Trinity Health
>> and is intended for the sole use of the intended recipient(s). It may
>> contain information that is privileged and confidential. Any unauthorized
>> review, use, disclosure, or distribution is prohibited. If you are not the
>> intended recipient, please delete this message, and reply to the sender
>> regarding the error in a separate email.
>>
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>
>
Am 2015-08-17 12:57, schrieb Eric Hunsberger:
> Does anyone know if concurrent kernels work on (newer) NVIDIA devices
> in OpenCL? If so, can anyone provide some PyOpenCL code that runs a
> minimal working example? As well as perhaps the driver version you're
> using?
>
> For context, "concurrent kernels" just means multiple kernels running
> at the same time. For example, if I have a bunch of kernels, each of
> which only takes up 32 work groups, and my device has a max work group
> size of 1024, then I should ideally be able to run 32 such kernels at
> the same time (in parallel). From what I've read, earlier NVIDIA GPUs
> didn't support this; they added support for up to 16 concurrent
> kernels with the Fermi architecture.
>
> There was a lot of discussion about concurrent kernels four years ago
> or so, based on the threads I've found, and at the time it wasn't
> clear if NVIDIA's OpenCL drivers supported this or not. I still can't
> find a conclusive answer as to whether it should work, and I can't get
> it working in my own code. I've seen several places that multiple
> queues are needed to do this, and even heard that it's necessary to
> flush all the queues, but I still can't get anything to work. NVIDIA
> devices can do this using CUDA:
> http://wiki.tiker.net/PyCuda/Examples/KernelConcurrency [1].
What version of the nvidia driver are using to try this? How do you
judge whether what you are trying is working or not? Can you share some
code that people can try on their own machines?
My naive perception is that you should just create multiple queues and
submit kernels to them, and things should just work. What happens if you
try and do that?
Andreas
Does anyone know if concurrent kernels work on (newer) NVIDIA devices in
OpenCL? If so, can anyone provide some PyOpenCL code that runs a minimal
working example? As well as perhaps the driver version you're using?
For context, "concurrent kernels" just means multiple kernels running at
the same time. For example, if I have a bunch of kernels, each of which
only takes up 32 work groups, and my device has a max work group size of
1024, then I should ideally be able to run 32 such kernels at the same time
(in parallel). From what I've read, earlier NVIDIA GPUs didn't support
this; they added support for up to 16 concurrent kernels with the Fermi
architecture.
There was a lot of discussion about concurrent kernels four years ago or
so, based on the threads I've found, and at the time it wasn't clear if
NVIDIA's OpenCL drivers supported this or not. I still can't find a
conclusive answer as to whether it should work, and I can't get it working
in my own code. I've seen several places that multiple queues are needed to
do this, and even heard that it's necessary to flush all the queues, but I
still can't get anything to work. NVIDIA devices can do this using CUDA:
http://wiki.tiker.net/PyCuda/Examples/KernelConcurrency.
Cheers,
Eric
I've noticed that using e.g. clmath._atan2(out, in1, in2, queue) with a
pre-allocated `out` array is nearly twice as fast as using
clmath.atan2(in1, in2, queue), even when a memory pool is used to
allocate the Array.
Consider the (simple) code here:
https://gist.github.com/hgomersall/d7a229df0f816388b63f
It defines the two test cases above inside a function which can be run
inside ipython as follows:
In [1]: from clmath_test import *
In [2]: timeit cl_test()
1000 loops, best of 3: 639 µs per loop
In [3]: timeit cl_test_preallocated()
1000 loops, best of 3: 363 µs per loop
Am I missing something here or is this expected behaviour?
Is _atan2 part of the stable API?
(this was on an nvidia machine. On my intel laptop, I seem to run into
this bug:
https://bugs.launchpad.net/ubuntu/+source/pyopencl/+bug/1354086)
Cheers,
Henry
Will you explain this to me a little more?
"One that jumps to the eye is that you're accessing 4 bytes of memory in an arbitrary place, but every time you're really loading up, and then writing back, a whole page! That's why it's so slow, even without atomic operations. The solution is local memory."
Sent from my Samsung Galaxy Tab® S
Confidentiality Notice:
This e-mail, including any attachments is the property of Trinity Health and is intended for the sole use of the intended recipient(s). It may contain information that is privileged and confidential. Any unauthorized review, use, disclosure, or distribution is prohibited. If you are not the intended recipient, please delete this message, and reply to the sender regarding the error in a separate email.