In the release notes for Version 2013.1 it says: More/improved Bessel
functions.
In the documentation however I can't find anything on how to create one. Did
I miss something?
Hi there,
I've got a newbie question:
Imagine some cl.array on the device. Now I'd like to overwrite part of it
with another array or constant. Let's say I want to set half of the array to
0 or something. How would I do that? array[0:y,0:x]=0. gives me a non-
contiguous error...
Sorry, I realize I'v got a lot of questions...
Here's another one. How can I forcibly release an array if it's no longer
needed? How does the allocator/mem pool decide wether to keep an array or to
release it? I'm writing an application that processes large complex arrays
and device memory is a serious issue.
On Tue, 29 Sep 2015 09:51:31 +0100
Blair Azzopardi <blairuk(a)gmail.com> wrote:
> Unfortunately, my Nvidia GPU is a Fermi type and so doesn't support OpenCL
> 1.2. Only Kepler/Maxwell GPUs do. C'est la vie!
Where did you get this information from ?
I wonder if nvidia will ever support 1.2 (they are speeking of switching directly to 2.1)
Cheers,
James <tomo.bbe(a)gmail.com> writes:
> I am looking for advice on how to use Program.compile and/or
> pyopencl.link_program. I had original thought I could logically divide
> parts of my CL code in to separate Programs/source files that are compiled,
> but not linked immediately. Then when I need to use modules together I can
> pass the list of compiled Program objects to link_program.
>
> However when I try to invoke compile from a Program initialised with a
> source code string I recieve the following error,
>
> def compile(self, options=[], devices=None, headers=[]):
> options = " ".join(options)
>> return self._prg.compile(options, devices, headers)
> E AttributeError: 'NoneType' object has no attribute 'compile'
>
> Looking at the pyopencl source it seems this is because self._prg is not
> initialized if a source string is given. Is this a bug, or am I trying to
> use compile in the wrong manner? Any advice welcome!
I'm not sure compile/link_program actually works in PyOpenCL. It doesn't
have any tests for it, which is probably a sign that it might not. If
you have a little bit of time, I'd appreciate if you could look into it
and submit a patch, ideally with tests. In the case above, I think
changing _prg to_get_prg() might already be enough.
Andreas
Blair Azzopardi <blairuk(a)gmail.com> writes:
> Are there any more up to date instructions on building pyopencl on windows?
> I have been using a precompiled binary until now but would like to
> recompile myself.
>
> I note that when I follow instructions from README_SETUP.txt and then
> compile I receive errors along the lines of:
>
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(53):
> error C2061: syntax error: identifier 'uint32_t'
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(59):
> error C2065: 'add': undeclared identifier
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(59):
> error C2059: syntax error: 'const'
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(96):
> error C2061: syntax error: identifier 'ssize_t'
> etc..
>
> From the wiki it suggests using boost which I've tried but whenever I
> enable BOOST* options in my siteconf.py I get errors along the lines of:
>
> KeyError: 'invalid config key in .\\siteconf.py: BOOST_INC_DIR'
>
> Which seems to suggest those options have since been removed and the wiki
> page needs updating.
The git version of pyopencl no longer relies on boost. (But the most
recent released version still does.) There were some threads on the
mailing list regarding building the new version on Windows:
http://lists.tiker.net/pipermail/pyopencl/2015-August/thread.htmlhttp://lists.tiker.net/pipermail/pyopencl/2015-August/001953.html
It'd be great if you could update the wiki. (I don't use Windows for
development, and so I'm not really in a position to maintain those
instructions.)
> Am 25.09.2015 um 15:06 schrieb Blair Azzopardi <blairuk(a)gmail.com>:
>
> Hi
>
> Are there any more up to date instructions on building pyopencl on windows? I have been using a precompiled binary until now but would like to recompile myself.
>
> I note that when I follow instructions from README_SETUP.txt and then compile I receive errors along the lines of:
>
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(53): error C2061: syntax error: identifier 'uint32_t'
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(59): error C2065: 'add': undeclared identifier
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(59): error C2059: syntax error: 'const'
> d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(96): error C2061: syntax error: identifier 'ssize_t'
> etc..
>
It seems you are using Microsoft VS2008, this compiler does not support the more recent C++ features used by pyopencl. For building pyopencl (recent master) on Windows probably the only way to go is using the static mingw toolchain. Follow e.g. the instructions at https://anaconda.org/carlkl/mingwpy <https://anaconda.org/carlkl/mingwpy> to install it.
Then
> python configure.py
edit siteconf.py, set the path to the OpenCL headers and libs
> python setup.py build_ext --compiler=mingw32
> python setup.py install
Current master of pyopencl does not use Boost anymore, so forget about it.
Hope that helped
Gregor
> From the wiki it suggests using boost which I've tried but whenever I enable BOOST* options in my siteconf.py I get errors along the lines of:
>
> KeyError: 'invalid config key in .\\siteconf.py: BOOST_INC_DIR'
>
> Which seems to suggest those options have since been removed and the wiki page needs updating.
>
> Thanks in advance,
> Blair
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
Hi All,
I am looking for advice on how to use Program.compile and/or
pyopencl.link_program. I had original thought I could logically divide
parts of my CL code in to separate Programs/source files that are compiled,
but not linked immediately. Then when I need to use modules together I can
pass the list of compiled Program objects to link_program.
However when I try to invoke compile from a Program initialised with a
source code string I recieve the following error,
def compile(self, options=[], devices=None, headers=[]):
options = " ".join(options)
> return self._prg.compile(options, devices, headers)
E AttributeError: 'NoneType' object has no attribute 'compile'
Looking at the pyopencl source it seems this is because self._prg is not
initialized if a source string is given. Is this a bug, or am I trying to
use compile in the wrong manner? Any advice welcome!
Regards,
James
Hi
Are there any more up to date instructions on building pyopencl on windows?
I have been using a precompiled binary until now but would like to
recompile myself.
I note that when I follow instructions from README_SETUP.txt and then
compile I receive errors along the lines of:
d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(53):
error C2061: syntax error: identifier 'uint32_t'
d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(59):
error C2065: 'add': undeclared identifier
d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(59):
error C2059: syntax error: 'const'
d:\software\windows\sources\pyopencl\src\c_wrapper\wrap_cl_core.h(96):
error C2061: syntax error: identifier 'ssize_t'
etc..
>From the wiki it suggests using boost which I've tried but whenever I
enable BOOST* options in my siteconf.py I get errors along the lines of:
KeyError: 'invalid config key in .\\siteconf.py: BOOST_INC_DIR'
Which seems to suggest those options have since been removed and the wiki
page needs updating.
Thanks in advance,
Blair
Yes you're right - it is buggy. The carry really should enqueued then added
to previous sum (as mentioned in my last email). I'll have to figure out if
that's possible. It may be useful for something else.
However, I've done some simple timings and note that simply passing over as
many work items as there are words without doing anything is slower than
running gmp add until the number of words reach around 250k+. So not worth
pursuing this approach although thanks for your help.
On Wed, Sep 23, 2015 at 1:23 PM, CRV§ADER//KY <crusaderky(a)gmail.com> wrote:
> Now your code makes more sense... however
>
> 1. it doesn't scale. You're limited to input data that is at most as large
> (in bytes) as min(local memory, 4*max group size). These two parameters
> will vary wildly between devices, so it's very poorly portable.
> 2. it's buggy as it doesn't consider that a carry can cause an overflow
> itself.
> Try summing up [2^32 - 1, 2^32- 1, 0] + [1, 0, 0]
> The expected result is [0, 0, 1] but you'll get [0, 0, 0].
>
> As for performance, I would look into using uint4s instead of uints. Since
> you'll have 1/4th of the barriers that way, I wouldn't be surprised to get
> a considerable performance boost.
> On 23 Sep 2015 09:46, "Blair Azzopardi" <blairuk(a)gmail.com> wrote:
>
>> Actually, this isn't the best way as it's being forced to run
>> sequentially. My original aim was to do the initial add then place the
>> carry array back into a queue to be added on a second pass. I'll see if
>> that's possible and revert. Thanks again.
>>
>> On Wed, Sep 23, 2015 at 9:14 AM, Blair Azzopardi <blairuk(a)gmail.com>
>> wrote:
>>
>>> That's very helpful. Thanks.
>>>
>>> You're right that I simplified my code for sake of brevity. I am trying
>>> to write a simple multiple precision adder and I'm using local memory to
>>> hold the carry. Since the carry is actually the same size as the inputs and
>>> hence global size, I've now set the local size to global size (I hadn't
>>> really considered this until I read your response). This appears to fix the
>>> problem apparently with and without the barrier.
>>>
>>> My only challenge now is improving performance. The complete kernel now
>>> looks like:
>>>
>>> #define MAXINT 4294967295U
>>> __kernel void sum_vector(__global const uint *x_g, __global const uint
>>> *y_g, __global uint *z_g, __local uint *c_l) {
>>>
>>> int i = get_local_id(0);
>>> uint z;
>>> uint c;
>>>
>>> if (x_g[i] > MAXINT - y_g[i]) { // overflow
>>> z = x_g[i] - (MAXINT - y_g[i]) - 1;
>>> c = 1;
>>> }
>>> else {
>>> z = x_g[i] + y_g[i];
>>> c = 0;
>>> }
>>>
>>> c_l[i] = c;
>>> barrier(CLK_LOCAL_MEM_FENCE); // wait for all threads to write local
>>> mem
>>>
>>> if (i > 0) {
>>> if (c_l[i-1] > 0) {
>>> z += c_l[i-1];
>>> }
>>> }
>>>
>>> z_g[i] = z;
>>> }
>>>
>>> Called using the following:
>>>
>>> a = np.array([...], dtype=np.uint32)
>>> b = np.array([...], dtype=np.uint32)
>>> c = np.zeros(max(len(a),len(b))+1, dtype=np.uint32)
>>> a.resize(c.shape)
>>> b.resize(c.shape)
>>>
>>> mf = cl.mem_flags
>>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>>> b_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=b)
>>> c_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(c)))
>>>
>>> sum_vector(queue, c.shape, c.shape, a_g, b_g, c_g,
>>> cl.LocalMemory(4*len(c)))
>>> cl.enqueue_copy(queue, c, c_g)
>>>
>>> Any suggestions here? Should I try and split the kernel code to allow it
>>> to divide the task across work groups? Perhaps my goal is a little futile?
>>>
>>>
>>> On Wed, Sep 23, 2015 at 7:37 AM, CRV§ADER//KY <crusaderky(a)gmail.com>
>>> wrote:
>>>
>>>> there's multiple problems with your code.
>>>> First, I assume that your actual kernel is more complicated than that
>>>> and that you stripped code away - otherwise, using local memory in this
>>>> case is completely useless.
>>>>
>>>> Second, if you want to use local memory, you gotta put
>>>> barrier(CLK_LOCAL_MEM_FENCE) between the instruction that writes it and the
>>>> one that reads it, as they are in different threads so there's no guarantee
>>>> which will arrive first.
>>>>
>>>> Third, if you have more than 1 workgroup, in other words if
>>>> get_global_id() != get_local_id(), then you're going to access memory that
>>>> simply does not exist.
>>>>
>>>> Without local memory:
>>>> __kernel void memskip(__global uint *a_g, __global uint *b_g) {
>>>> int i = get_global_id(0);
>>>> if (i > 0)
>>>> b_g[i] = a_g[i-1];
>>>> else
>>>> b_g[i] = 0;
>>>> }
>>>>
>>>> With local memory; only useful if your code is actually more complex
>>>> than this - that is, it needs to access the same cell of a_g multiple
>>>> times, or write to the same cell of b_g multiple times
>>>>
>>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>>>> __local uint *c_l) {
>>>> int gid = get_global_id(0);
>>>> int lid = get_local_id(0);
>>>>
>>>> // c_l must be the same size as the workgroup
>>>> c_l[lid] = a_g[gid];
>>>>
>>>> // wait for all threads of a workgroup to reach this point
>>>> barrier(CLK_LOCAL_MEM_FENCE)
>>>>
>>>> if (lid > 0)
>>>> b_g[gid] = c_l[lid-1];
>>>> else if (gid > 0)
>>>> // can't use local memory!!! Your work group size better be
>>>> much larger than our wavelength (32 on NVidia or 64 on AMD), or the whole
>>>> benefit of local memory will vanish!
>>>> b_g[gid] = a_g[gid - 1];
>>>> else
>>>> b_g[gid] = 0;
>>>> }
>>>>
>>>> HTH
>>>> On 23 Sep 2015 00:08, "Blair Azzopardi" <blairuk(a)gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am trying to trace down a particular issue where a local memory
>>>>> parameter appears to vanish the 1st time I run a kernel but on every
>>>>> subsequent invocation works correctly. This is on my Nvidia Geforce GT
>>>>> 555M.
>>>>>
>>>>> While trying to trace the above issue and switching to my Intel CPU I
>>>>> came across even more unusual behaviour.
>>>>>
>>>>> With the following Kernel
>>>>>
>>>>> __kernel void localmemskip(__global uint *a_g, __global uint *b_g,
>>>>> __local uint *c_l) {
>>>>> int i = get_global_id(0);
>>>>> c_l[i] = a_g[i];
>>>>> b_g[i] = 0;
>>>>> if (i > 0)
>>>>> b_g[i] = c_l[i-1];
>>>>> }
>>>>>
>>>>> I invoke the following code:
>>>>>
>>>>> a = np.array([1,2,3,4,5,6,7,8,9,10], dtype=np.uint32)
>>>>> b = np.zeros(10, dtype=np.uint32)
>>>>>
>>>>> mf = cl.mem_flags
>>>>> a_g = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a)
>>>>> b_g = cl.Buffer(ctx, mf.WRITE_ONLY, 4*len(b))
>>>>>
>>>>> localmemskip(queue, b.shape, None, a_g, b_g, cl.LocalMemory(4*len(b)))
>>>>> cl.enqueue_copy(queue, b, b_g)
>>>>> print(b)
>>>>>
>>>>> Then the output for each platform is:
>>>>>
>>>>> [0 1 2 3 4 5 6 7 8 9] # <pyopencl.Platform 'NVIDIA CUDA' >
>>>>>
>>>>> [0 1 2 0 4 0 6 242802248 0 361062976] # <pyopencl.Platform 'Intel(R) OpenCL' >
>>>>>
>>>>>
>>>>> What's more is each output can start with a different array settling
>>>>> down to the above after 1st invocation.
>>>>>
>>>>> I have my suspicions it's something to do with how the local memory is
>>>>> initialised.
>>>>>
>>>>> Can anyone see if I am doing something wrong?
>>>>>
>>>>> Thanks
>>>>> Blair
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> PyOpenCL mailing list
>>>>> PyOpenCL(a)tiker.net
>>>>> http://lists.tiker.net/listinfo/pyopencl
>>>>>
>>>>>
>>>
>>