On Wed, 23 Feb 2011 12:12:13 +0100, Magnus Paulsson <paulsson.m(a)gmail.com> wrote:
1'st, thanks for developing pyCUDA. Just started playing with it
week and have already code that outperforms the numpy version 10-100
fold. However, some things are still unclear to me so I will mix
explaining how I understand things and ask questions. Please correct
me if my understanding is faulty.
1: gpuarray: I only use gpuarray to send data to the device. Even if
I use my own kernels or scikit.cuda on the data. However, as the
following example demonstrates, you have to make copies of the numpy
array before sending it to the gpu to ensure consistent indexing
(c-type storage without any strange strides) for multi-dimensional
import pycuda.gpuarray as gpuarray
import numpy as N
a[0,1] = 1
a[1,0] = 2
print "\ngpu a=\n",gpuarray.to_gpu(a).get()
print "\ngpu a^T=\n",gpuarray.to_gpu(aT).get()
print "\ngpu a^T.copy()=\n",gpuarray.to_gpu(aT.copy()).get()
Note that the gpuarray.to_gpu(aT) is not transposed as it should be.
However, making the copy cures this.
Right-- .T in numpy doesn't change memory layout, it just gives you a
new numpy array pointing to the same storage with different
meta-information about strides and array dimensions. Since PyCUDA copies
the bare bits, the memory layout on the GPU is unchanged as well.
2: Async to device: I read that you need page-locked memory on the
host for the async copies to work.
Does pycuda.gpuarray.to_gpu_async(x) lock the memory of the numpy
array x or copy the data to a locked memory area?
Neither. You need to have made the array using
3: gpuarray.get_async(): Is control returned to python before the
transfer is completed (as async would indicate)? How do I check when
the transfer is complete?
Streams and events.
Automatically allocated for you in get_async().
Do I have to create streams and events to make
async copies work?
Not necessarily, but without them it's kind of useless.
4: Streams: My understanding is that each stream is executed
while different streams are running in parallel. Except stream "0"
which waits for all other streams to finish before starting. Any
We don't have a simple one in PyCUDA--if you'd like to write one, that'd
be much appreciated.