Keegan Owsley <keegano(a)gmail.com> writes:
> Something that I don't think I made clear before: the kernels generated by
> get_elwise_module_noncontig are modified using regular expressions, so that
> you don't need to change your code downstream to get strided array support.
> I'm not convinced yet that this is the best approach, but it works okay so
> far.
To make it easier to see your changes and comment on them (such as
exactly how robust the Regex stuff is), could you put this up as a pull
request here?
https://gitlab.tiker.net/inducer/pycuda
This will automatically run tests, so it's easier for me to handle.
Thanks!
Andreas
Alright, I've updated the code in my repository. It now passes all of the
unit tests in test_gpuarray (at least, the ones that were passing before).
The old behavior is the default, so hopefully that should fix compatibility
with existing code (still largely untested). To create a noncontiguous
kernel, pass use_strides=True into get_elwise_kernel, and pass
gpuarray.device_shape_and_strides
into func.prepared_async_call. I've updated all of the methods in gpuarray.py
to check for any noncontiguous inputs/outputs, and call the appropriate
function.
Something that I don't think I made clear before: the kernels generated by
get_elwise_module_noncontig are modified using regular expressions, so that
you don't need to change your code downstream to get strided array support.
I'm not convinced yet that this is the best approach, but it works okay so
far.
So, per example...
Code like this:
@context_dependent_memoize
def get_binary_op_kernel(dtype_x, dtype_y, dtype_z, operator,
use_strides=False):
return get_elwise_kernel(
"%(tp_x)s *x, %(tp_y)s *y, %(tp_z)s *z" % {
"tp_x": dtype_to_ctype(dtype_x),
"tp_y": dtype_to_ctype(dtype_y),
"tp_z": dtype_to_ctype(dtype_z),
},
"z[i] = x[i] %s y[i]" % operator,
"multiply", use_strides=use_strides)
Generates kernels like this:
#include <pycuda-complex.hpp>
typedef struct
{
unsigned n[2];
long stride[2];
} dim;
__device__ unsigned i2m(unsigned i, dim d)
{
unsigned m = 0;
unsigned j = i;
for(int k = 0; k < 2; k++)
{
m += d.stride[k] * (j%d.n[k]);
j = j / d.n[k];
}
return m;
}
__global__ void multiply(double *x, double *y, double *z,
unsigned long long n, dim *__dim0, dim *__dim1, dim *__dim2)
{
unsigned tid = threadIdx.x;
unsigned total_threads = gridDim.x*blockDim.x;
unsigned cta_start = blockDim.x*blockIdx.x;
unsigned i;
;
for (i = cta_start + tid; i < n; i += total_threads)
{
z[i2m(i,*__dim2)] = x[i2m(i,*__dim0)] / y[i2m(i,*__dim1)];
}
;
}
and you use them like this:
def is_noncontig(*vecs):
r = False
for v in vecs: r = r or not v.flags.forc
return r
def _div(self, other, out, stream=None):
"""Divides an array by another array."""
assert self.shape == other.shape
noncontig = is_noncontig(self, other, out)
func = elementwise.get_binary_op_kernel(self.dtype, other.dtype,
out.dtype, "/", use_strides=noncontig)
if noncontig:
func.prepared_async_call(self._grid, self._block, stream,
self.gpudata, other.gpudata,
out.gpudata, self.mem_size,
self.device_shape_and_strides,
other.device_shape_and_strides,
out.device_shape_and_strides)
else:
func.prepared_async_call(self._grid, self._block, stream,
self.gpudata, other.gpudata,
out.gpudata, self.mem_size)
In principle, similar updates can be made to modules that depend on
pycuda.elementwise down the road, to also support strided arrays without
big rewrites.
I'm open to suggestions for improving performance, but I'd like to focus on
compatibility for now. Frédéric's suggestion is a good one, but I suspect
there's another source of slowdowns somewhere, because the slowdowns I'm
seeing don't scale with array size.
Keegan
On Thu, Dec 1, 2016 at 11:54 AM, Keegan Owsley <keegano(a)gmail.com> wrote:
> It seems striding in kernels can indeed incur a large penalty, though the
> result is dependent on the array size. I've just updated the code to only
> use the noncontiguous kernel if necessary.
>
> Using a GTX 970, I get the following results using the pow kernel (I'm
> using ipython's %timeit here, because for some reason python appears to
> hang on my machine if I use the standard timeit module):
>
> >>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray
> >>> b1 = gpuarray.arange(100000, dtype='float64').reshape(1000, 100)
> >>> b2 = b1[::2, :-1].copy()
> >>> # force CUDA to compile the kernels before we time
> >>> b1[::2,:-1]**2
> >>> b2**2
>
> >>> %timeit b1[::2,:-1]**2
> 1000 loops, best of 3: 654 µs per loop
> >>> %timeit b2**2
> 10000 loops, best of 3: 147 µs per loop
>
> >>> b1 = gpuarray.arange(1000000, dtype='float64').reshape(1000, 1000)
> >>> b2 = b1[::2, :-1].copy()
>
> >>> %timeit b1[::2,:-1]**2
> 100 loops, best of 3: 2.05 ms per loop
> >>> %timeit b2**2
> 1000 loops, best of 3: 1.66 ms per loop
>
> I'll try clean up the code and get it into my repo tpday.
>
> Keegan
>
>
> On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien <
> frederic.bastien(a)gmail.com> wrote:
>
>> I can share our experience in Theano related to that.
>>
>> I added code to "fuse" dimensions that are contiguous. So if you have a
>> 3d tensor, and only 1 dimensions is non contiguous, you won't pay the
>> indexing of a 3d tensor, but only of a 2d tensor.
>>
>> I saw 2x speed up in such cases. It was old gpu and old cuda version,
>> maybe this changed.
>>
>> In the new Theano back-end, we have rewriting that in a more readable
>> version:
>>
>> https://github.com/Theano/libgpuarray/blob/master/src/gpuarr
>> ay_util.c#L153
>>
>> This also take into account the broadcasting.
>>
>> This could be done before call the kernel.
>>
>> On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner <
>> lists(a)informa.tiker.net> wrote:
>>
>>> Keegan,
>>>
>>> Keegan Owsley <keegano(a)gmail.com> writes:
>>> > I've just slapped together a patch to pycuda that makes most
>>> elementwise
>>> > operations work with noncontiguous arrays. There are a bunch of hacks
>>> in
>>> > there, and the code needs some reorg before it's ready to be
>>> considered for
>>> > upstream (I made these changes while learning the pycuda codebase, so
>>> > there's a bunch of crud that can be cleaned out), but I figure I might
>>> as
>>> > well put it out there in its current state and see what you guys think.
>>> > It's also not extremely well-tested (I have no idea if it interferes
>>> with
>>> > skcuda, for example), but all of the main functions appear to work.
>>> >
>>> > You can check out the code at https://bitbucket.org/owsleyk_
>>> omega/pycuda.
>>> >
>>> > Briefly, this works by adding new parameters into elementwise kernels
>>> that
>>> > describe the stride and shape of your arrays, then using a function
>>> that
>>> > computes the location in memory from the stride, shape, and index.
>>> > Elementwise kernel ops are modified so that they use the proper
>>> indexing.
>>> > See an example of a kernel that's generated below:
>>>
>>> Thanks for putting this together and sharing it! I have one main
>>> question about this, regarding performance:
>>>
>>> Modulo (especially variable-denominator modulo) has a habit of being
>>> fantastically slow on GPUs. Could you time contiguous
>>> vs. noncontiguous for various levels of "gappiness" and number of
>>> axes? I'm asking this because I'd be OK with a 50% slowdown, but not
>>> necessarily a factor of 5 slowdown on actual GPU hardware.
>>>
>>> Thanks!
>>> Andreas
>>>
>>> _______________________________________________
>>> PyCUDA mailing list
>>> PyCUDA(a)tiker.net
>>> https://lists.tiker.net/listinfo/pycuda
>>>
>>
>>
>
Jaroslaw,
Jaroslaw Blusewicz <jaroslaw.blusewicz(a)gmail.com> writes:
> I'm using numpy-sharedmem <https://bitbucket.org/cleemesser/numpy-sharedmem>
> to allocate shared memory array across multiple cpu processes. However,
> after page locking it with register_host_memory, the shared memory is never
> cleared at exit. Below is a minimal example of this behavior on Ubuntu
> 16.04, python 2.7.12 and pycuda 2016.1.2:
>
> import sharedmem
> import numpy as np
> from pycuda import autoinit
> import pycuda.driver as driver
>
> arr = sharedmem.zeros(10 ** 8, dtype=np.float32)
> arr = driver.register_host_memory(arr,
> flags=driver.mem_host_register_flags.DEVICEMAP)
>
> At exit, this shared memory array is not cleared. Unregistering the
> pagelocked memory beforehand doesn't work either.
>
> Also, I noticed that RegisteredHostMemory instance in arr.base, which
> according to the documentation
> <https://documen.tician.de/pycuda/driver.html#pycuda.driver.RegisteredHostMe…>
> should have base attribute containing the original array, doesn't actually
> have it.
>
> Is there a manual way of clearing this shared memory in pycuda that I'm
> missing?
I'm honestly not sure that pagelocked and SysV shared memory have a
defined interaction, i.e. I don't even know what's supposed to
happen. And at any rate, for what you're doing, you're just getting the
behavior of the CUDA API--I'm not sure PyCUDA could help or hurt in your
case.
tl;dr: Ask someone at Nvidia if this supposed to work, and if it is, and
if PyCUDA breaks it, I'm happy to try and help fix it.
Andreas
It seems striding in kernels can indeed incur a large penalty, though the
result is dependent on the array size. I've just updated the code to only
use the noncontiguous kernel if necessary.
Using a GTX 970, I get the following results using the pow kernel (I'm
using ipython's %timeit here, because for some reason python appears to
hang on my machine if I use the standard timeit module):
>>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray
>>> b1 = gpuarray.arange(100000, dtype='float64').reshape(1000, 100)
>>> b2 = b1[::2, :-1].copy()
>>> # force CUDA to compile the kernels before we time
>>> b1[::2,:-1]**2
>>> b2**2
>>> %timeit b1[::2,:-1]**2
1000 loops, best of 3: 654 µs per loop
>>> %timeit b2**2
10000 loops, best of 3: 147 µs per loop
>>> b1 = gpuarray.arange(1000000, dtype='float64').reshape(1000, 1000)
>>> b2 = b1[::2, :-1].copy()
>>> %timeit b1[::2,:-1]**2
100 loops, best of 3: 2.05 ms per loop
>>> %timeit b2**2
1000 loops, best of 3: 1.66 ms per loop
I'll try clean up the code and get it into my repo tpday.
Keegan
On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien <
frederic.bastien(a)gmail.com> wrote:
> I can share our experience in Theano related to that.
>
> I added code to "fuse" dimensions that are contiguous. So if you have a 3d
> tensor, and only 1 dimensions is non contiguous, you won't pay the indexing
> of a 3d tensor, but only of a 2d tensor.
>
> I saw 2x speed up in such cases. It was old gpu and old cuda version,
> maybe this changed.
>
> In the new Theano back-end, we have rewriting that in a more readable
> version:
>
> https://github.com/Theano/libgpuarray/blob/master/src/gpuarray_util.c#L153
>
> This also take into account the broadcasting.
>
> This could be done before call the kernel.
>
> On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner <
> lists(a)informa.tiker.net> wrote:
>
>> Keegan,
>>
>> Keegan Owsley <keegano(a)gmail.com> writes:
>> > I've just slapped together a patch to pycuda that makes most elementwise
>> > operations work with noncontiguous arrays. There are a bunch of hacks in
>> > there, and the code needs some reorg before it's ready to be considered
>> for
>> > upstream (I made these changes while learning the pycuda codebase, so
>> > there's a bunch of crud that can be cleaned out), but I figure I might
>> as
>> > well put it out there in its current state and see what you guys think.
>> > It's also not extremely well-tested (I have no idea if it interferes
>> with
>> > skcuda, for example), but all of the main functions appear to work.
>> >
>> > You can check out the code at https://bitbucket.org/owsleyk_
>> omega/pycuda.
>> >
>> > Briefly, this works by adding new parameters into elementwise kernels
>> that
>> > describe the stride and shape of your arrays, then using a function that
>> > computes the location in memory from the stride, shape, and index.
>> > Elementwise kernel ops are modified so that they use the proper
>> indexing.
>> > See an example of a kernel that's generated below:
>>
>> Thanks for putting this together and sharing it! I have one main
>> question about this, regarding performance:
>>
>> Modulo (especially variable-denominator modulo) has a habit of being
>> fantastically slow on GPUs. Could you time contiguous
>> vs. noncontiguous for various levels of "gappiness" and number of
>> axes? I'm asking this because I'd be OK with a 50% slowdown, but not
>> necessarily a factor of 5 slowdown on actual GPU hardware.
>>
>> Thanks!
>> Andreas
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA(a)tiker.net
>> https://lists.tiker.net/listinfo/pycuda
>>
>
>
Hello,
I'm using numpy-sharedmem <https://bitbucket.org/cleemesser/numpy-sharedmem>
to allocate shared memory array across multiple cpu processes. However,
after page locking it with register_host_memory, the shared memory is never
cleared at exit. Below is a minimal example of this behavior on Ubuntu
16.04, python 2.7.12 and pycuda 2016.1.2:
import sharedmem
import numpy as np
from pycuda import autoinit
import pycuda.driver as driver
arr = sharedmem.zeros(10 ** 8, dtype=np.float32)
arr = driver.register_host_memory(arr,
flags=driver.mem_host_register_flags.DEVICEMAP)
At exit, this shared memory array is not cleared. Unregistering the
pagelocked memory beforehand doesn't work either.
Also, I noticed that RegisteredHostMemory instance in arr.base, which
according to the documentation
<https://documen.tician.de/pycuda/driver.html#pycuda.driver.RegisteredHostMe…>
should have base attribute containing the original array, doesn't actually
have it.
Is there a manual way of clearing this shared memory in pycuda that I'm
missing?
Thanks in advance,
Jarek
Keegan,
Keegan Owsley <keegano(a)gmail.com> writes:
> I've just slapped together a patch to pycuda that makes most elementwise
> operations work with noncontiguous arrays. There are a bunch of hacks in
> there, and the code needs some reorg before it's ready to be considered for
> upstream (I made these changes while learning the pycuda codebase, so
> there's a bunch of crud that can be cleaned out), but I figure I might as
> well put it out there in its current state and see what you guys think.
> It's also not extremely well-tested (I have no idea if it interferes with
> skcuda, for example), but all of the main functions appear to work.
>
> You can check out the code at https://bitbucket.org/owsleyk_omega/pycuda.
>
> Briefly, this works by adding new parameters into elementwise kernels that
> describe the stride and shape of your arrays, then using a function that
> computes the location in memory from the stride, shape, and index.
> Elementwise kernel ops are modified so that they use the proper indexing.
> See an example of a kernel that's generated below:
Thanks for putting this together and sharing it! I have one main
question about this, regarding performance:
Modulo (especially variable-denominator modulo) has a habit of being
fantastically slow on GPUs. Could you time contiguous
vs. noncontiguous for various levels of "gappiness" and number of
axes? I'm asking this because I'd be OK with a 50% slowdown, but not
necessarily a factor of 5 slowdown on actual GPU hardware.
Thanks!
Andreas