Alexander Bock <alexander.asp.bock(a)gmail.com> writes:
I am creating some timing tests with PyCUDA for
batch-loading an image
sequence. I first tried timing a normal, synchronous transfer over global
Now I am looking to test pagelocked memory, specifically, I would like to
test: Single-stream, pagelocked synchronous transfers, multi-stream,
asynchronous pagelocked transfers and zero-copy memory using device mapped
For the first one, do I simply call pycuda.driver.memcpy_htod/dtoh using
the pagelocked memory (I am using memflags=0 for creating the pagelocked
memory, I assume it corresponds to cudaHostAllocDefault?) For the second, I
would use the memcpy_(htod/dtoh)_async calls with more than one stream (my
laptop supports concurrent kernels). For the final one, I would create my
own context using pycuda.driver.make_context with the MAP_HOST flag,
allocate the pagelocked memory using host_alloc_flags.DEVICE_MAP and call
my kernel with the device pointer? Am I on the right track?
Yep, that sounds right.
In terms of documentation, the CUDA programming guide applies. One thing
to notice is to look at the "driver" interface, not the "runtime"
interface. The lowest layer of PyCUDA is just a coat of Python paint on that.
Example and docs contributions would be more than welcome!