thank you for your suggestions. I must admit I'd rather stick with using
high level functions coming with pyopencl or reikna. Writing my own
opencl kernels is a little out of reach for me. I'll deal with this when
I have more complex sub tasks to solve. That transposing thing of mine
works reasonably well and is still faster than padding on the host.
Newbie question: is it even possible that fancy indexing will work one
day on GPUs?