Hi,
I'm trying to use multiple gpus with mpi and ipc handles instead of the
built-in mpi primitives to p2p communication.
I think I'm not quite understanding how contexts should be managed. For
example, I have two versions of a toy example to try out accessing data
between nodes via ipc handle. Both seem to work, in the sense that process
1 can 'see' the data from process 0, but the first version completes
without any error, while the second version generates the following error:
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
│···········
cuMemFree failed: invalid value
The two versions are attached below. Would appreciate any insight as to
what I'm doing wrong.
-Alex
Here are the two versions:
*VERSION 1*
from mpi4py import MPIimport numpy as npimport atexitimport
pycuda.driver as drvimport pycuda.gpuarray as gpuarrayclass
TestMGPU(object): def __init__(self): self.mpi_size =
MPI.COMM_WORLD.size self.mpi_rank = MPI.COMM_WORLD.rank def
proc(self): if self.mpi_rank == 0: ctx =
drv.Device(self.mpi_rank).make_context() self.x_gpu =
gpuarray.to_gpu(np.random.rand(8)) h =
drv.mem_get_ipc_handle(self.x_gpu.ptr)
MPI.COMM_WORLD.send((h, self.x_gpu.shape, self.x_gpu.dtype), dest=1)
print 'p1 self.x_gpu:', self.x_gpu ctx.detach()
else: ctx = drv.Device(self.mpi_rank).make_context()
h, s, d = MPI.COMM_WORLD.recv(source=0) ptr =
drv.IPCMemoryHandle(h) xt_gpu = gpuarray.GPUArray(s, d,
gpudata=ptr) print 'xt_gpu: ', xt_gpu
ctx.detach()if __name__ == '__main__': drv.init()
atexit.register(MPI.Finalize) a = TestMGPU() a.proc()
*VERSION 2 (Imports are the same)*
class TestMGPU(object): def __init__(self): self.mpi_size =
MPI.COMM_WORLD.size self.mpi_rank = MPI.COMM_WORLD.rank
self.x_gpu = gpuarray.to_gpu(np.random.rand(8)) def proc(self):
if self.mpi_rank == 0: h =
drv.mem_get_ipc_handle(self.x_gpu.ptr)
MPI.COMM_WORLD.send((h, self.x_gpu.shape, self.x_gpu.dtype), dest=1)
print 'p1 self.x_gpu:', self.x_gpu else: h,
s, d = MPI.COMM_WORLD.recv(source=0) ptr =
drv.IPCMemoryHandle(h) xt_gpu = gpuarray.GPUArray(s, d,
gpudata=ptr) print 'xt_gpu: ', xt_gpuif __name__ ==
'__main__': drv.init() ctx =
drv.Device(MPI.COMM_WORLD.rank).make_context()
atexit.register(ctx.pop) atexit.register(MPI.Finalize) a =
TestMGPU() a.proc()
Ananth Sridharan <ananth(a)umd.edu> writes:
> Can someone shed some light on the arguments for the "prepare" used by
> pycuda? I have been unable to find a set of examples to help understand
> what the arguments are supposed to look like.
>
> On the official website,
> http://documen.tician.de/pycuda/tutorial.html?highlight=prepare
> what does the argument "P" stand for? In another example, I found someone
> using "PiPi" - what is the prepare statement supposed to be for, say, a
> kernel that looks like this..
>
> __global__ dosomething(float *a, double *b, float *c, int nmax)
>
> is it supposed to be func.prepare("fdfi"), or did I not understand the
> concept correctly?
This page defines the format:
https://docs.python.org/2/library/struct.html
Andreas
Can someone shed some light on the arguments for the "prepare" used by
pycuda? I have been unable to find a set of examples to help understand
what the arguments are supposed to look like.
On the official website,
http://documen.tician.de/pycuda/tutorial.html?highlight=prepare
what does the argument "P" stand for? In another example, I found someone
using "PiPi" - what is the prepare statement supposed to be for, say, a
kernel that looks like this..
__global__ dosomething(float *a, double *b, float *c, int nmax)
is it supposed to be func.prepare("fdfi"), or did I not understand the
concept correctly?
Dear Christian,
First of all, please make sure to keep the list cc'd in your
replies. Without the benefits of searchable archival, justifying the
time to answer questions is much harder.
Christian Hacker <christian.b.hacker(a)gmail.com> writes:
> Thank you for your reply. If I'm understanding you correctly, it is
> acceptable to have numpy arrays of objects allocated on the host, and then
> assigning GPUArray instances as elements of those arrays. I didn't take
> into account the overhead from launching the kernel - that may explain why
> things work so slowly. I will attempt to test the simulator with larger
> network topologies once I have pycuda set up on a machine with a
> sufficiently powerful GPU.
>
> If you will indulge my ignorance a little more, there is another problem I
> would request advice for. I have run into a possible bottleneck in the
> learning algorithm, specifically where the simulator must compare the
> calculated error of the current learning cycle to a user-defined threshold
> value to determine if further learning is required. Currently I am storing
> this threshold value in a (1, 1) GPUArray and using the overloaded
> comparison operators to check it against the calculated network error, also
> stored on the GPU. The issue is that the code driving the simulator is all
> host-side: a conditional statement checks the verity of the comparison and
> decides whether to continue working. Because a comparison of values on two
> GPUArrays will return a GPUArray with a binary integer value, whereas
> Python conditionals require a binary integer value, I have no choice but to
> transfer a single binary integer value from the device to the host - every
> single learning cycle. Due to the variety of operations the simulator must
> conduct each learning cycle, it would be unwieldy and, perhaps, impossible
> to use an if_positive(...) function to sidestep this issue. So, following
> all of that prologue, here is another question:
>
> Is it possible to write a custom kernel (or even a Python function) that
> can return integer values to the Python interpreter after evaluating GPU
> array data, without requiring the transfer of any of that data from the
> device to the host?
Yes, but only in a limited way. With enough mapping/unmapping logic,
device kernels can indeed write to host memory. However I would
anticipate that the latency incurred in this process is similar (if
not worse) than the one involved in reading from the device.
Quite simply, if data resides on the device, the only way to get it off
of there is a read. Perhaps the only way (and quite an easy one if I
understand your situation right) would be to continue the computation
(overlapped with the transfer) and defer the convergence check until the
transfer finishes. Here's an example of code that does this:
https://github.com/inducer/pycuda/blob/master/pycuda/sparse/cg.py
Andreas
Hi Christian,
Christian Hacker <christian.b.hacker(a)gmail.com> writes:
> So my question is this: does referencing a GPUArray from within a numpy
> array of objects entail some kind of ungodly overhead, and is there a
> *good* way to store a "jagged" GPUArray?
FWIW, I use object arrays with GPUArrays in them all the time, and they
work just fine. One thing to note is that a separate kernel will be
launched to perform arithmetic on each of the sub-arrays. As a result,
if the sub-array size is small enough that kernel launch overhead is
comparable to the cost of the operation on the array, then you will
start seeing a performance impact. I would say that as soon as the size
of your sub-arrays is around 10,000 or so, you should be OK.
If your sub-arrays are smaller and you care about every last bit of
performance, you will likely need to roll a custom solution that stores
segment boundaries along with the array.
Hope that helps,
Andreas
Greetings. I am developing a supervised learning neural network simulator
with complex weights (MLMVN) and am attempting to parallelize the
underlying linear algebra with pycuda. Thus far I've managed to implement
all required functions using the GPUArray class, the pycuda.cumath module,
and the scikits.cuda library without writing a single custom kernel, but
there's a significant caveat. Because the topology of the network (# of
layers, # of neurons per layer) is generated dynamically, the simulator
must be able to routinely create a variable number of 2d arrays (with
varying shapes) that contain the weights for each layer. Consequentially,
what I need is an array of arrays, where each subarray has dimensions that
are specific to the layer it represents. If I were implementing this in
numpy, this would be trivial: create a 1d array with dtype=object and
shape=(# of layers, ), and then assign the 2d array of weights for each
layer to the corresponding element in the 1d array. This would be similar
to a Matlab cell object or a C# jagged array, as the internal dimensions of
the array are not all the same. Because the pycuda.gpuarray class doesn't
support element assignments, creating a device-side container analogous to
the described numpy array isn't possible. I've tried constructing a
"jagged" numpy array and simply using gpuarray.to_gpu(numpy_array), but
that does not work and seems to "confuse" my graphics card. The only
solution I've been able to find is to allocate a 1d numpy array of objects
as before, but then iteratively assign separate GPUArrays as each element
to represent the weight arrays for each layer. In other words, each element
of the 1d numpy array is a pointer to a GPUArray on the graphics card.
There is significant overhead (~ 1 order of magnitude) in accessing each
GPUArray compared to accessing a numpy array stored on the host machine,
but I assumed that this would be a non-issue since the host code wouldn't
be modifying those GPUArrays anyway, just referring them to the
pycuda.cumath functions and gpuarray operators. This assumption appears to
be incorrect - the GPU simulator runs extremely slowly, and its performance
only deteriorates with increasing sizes of learning sets. On the bright
side, it can (and has) converge(d). My conclusion is that the device and
host are constantly swapping data during the simulation, and I suspect my
method for storing the weights of each layer is to blame.
So my question is this: does referencing a GPUArray from within a numpy
array of objects entail some kind of ungodly overhead, and is there a
*good* way to store a "jagged" GPUArray? If anyone is willing to help me
through this issue, I will be grateful. Source code will be provided upon
request. Apologies for the length and, no-doubt, plethora of mistakes made
in this posting.
CH
Hi!
I just installed pycuda on my system. I have Windows 8.1 and a GTX970 in
my Notebook. Therefore I am using Cuda 7 as the only compatible Cuda
Version with my system.
I ran into 2 problems and I hope someone can help me.
1: If I enable CURAND I cannot build pycuda. I always get 5 unresolved
external function errors.
Those externals are all related to CURAND and I think this might be
because the CURAND library is not included in the Cuda Win32 Lib folder.
2: Building without CURAND works and installing also works fine.
I get some warnings regarding deprecated NumPy API but that does not
seem to be the problem.
When I try to run one of the example programs from the pycuda wiki (no
matter which) I get an error in:
pycuda\tools.py line 42:
ImportError: No module named compyte.dtypes
I have no idea why this error occurs. Here is my siteconf.py maybe it
helps identifying the problem:
BOOST_INC_DIR = []
BOOST_LIB_DIR = []
BOOST_COMPILER = 'gcc43'
USE_SHIPPED_BOOST = True
BOOST_PYTHON_LIBNAME = ['boost_python-py27']
BOOST_THREAD_LIBNAME = ['boost_thread']
CUDA_TRACE = False
CUDA_ROOT = 'C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v7.0'
CUDA_ENABLE_GL = False
CUDA_ENABLE_CURAND = False
CUDADRV_LIB_DIR = ['${CUDA_ROOT}/lib/Win32', '${CUDA_ROOT}/lib/x64']
CUDADRV_LIBNAME = ['cuda']
CUDART_LIB_DIR = ['${CUDA_ROOT}/lib/Win32', '${CUDA_ROOT}/lib/x64']
CUDART_LIBNAME = ['cudart']
CURAND_LIB_DIR = ['${CUDA_ROOT}/lib/Win32', '${CUDA_ROOT}/lib/x64']
CURAND_LIBNAME = ['curand']
CXXFLAGS = ['/EHsc']
LDFLAGS = []
Thank you very much!
Jannes Nagel
Thanks for the suggestions! I tried the include file option, and it works
nicely. For someone who runs into the same problem later on, detailed
answer here:
http://stackoverflow.com/questions/11290536/passing-a-c-cuda-class-to-pycud…
On Mon, Mar 30, 2015 at 12:46 PM, Ananth Sridharan <ananth(a)umd.edu> wrote:
>
> > Hi,
> > I have a simulation code which requires the use of multiple kernels.
> Each of these kernels (global functions) needs to call a common set of
> device functions. To organize code better, I'd like to provide multiple
> source modules - one (or more) for the kernels, and one for the common
> dependencies.
> >
> > I'm missing the syntax (if it exists) to let the source module
> containing the kernels "know" the functions in the source module containing
> the device functions. Can someone help me out?
> > (I'm a pyCuda novice, and have basic working knowledge of
> cuda-c/cuda-fortran)
> >
> Ananth
>