Yiming Peng <ypeng(a)u.northwestern.edu> writes:
> Hi Andreas,
>
> I am a former student of your CS 450 and now I am a incoming PhD student in
> operations research at Northwestern.
>
> Since I am interested in applying parallel computing, preferably using
> Python, to my future research, I have been looking for software which
> combines Python with CUDA. Then I found PyCUDA on your website. And I found
> NumbaPro. It seems that these two are the most popular choices for people
> with needs like mine.
>
> So my question is: which one do I begin to learn and use first? Could you
> give some comments on pros and cons about the two?
Cc'ing the PyCUDA list for archival/searchability.
- PyCUDA lets you/forces you to write CUDA C for your kernels.
- Numba lets you write (a narrow subset of) Python for your kernels,
including arrays I believe.
- The code you write for both will be roughly equivalent modulo
spelling, since you'll have to
- PyCUDA exposes (nearly) the entire CUDA runtime, including streams,
profiling, textures, ... Numba is more restricted.
- PyCUDA comes with an on-device array type. I'm not sure if Numba's
arrays stay on-device after the computation finishes--i.e. you may
have some implicit copying.
- PyCUDA comes with some pre-made parallel algorithms such as scans
and reductions.
- You may also want to take a look at
- https://documen.tician.de/pyopencl/
- https://documen.tician.de/loopy/
Hope that helps,
Andreas
Irving Enrique Reyna Nolasco <irvingenrique.reynanolasco(a)kaust.edu.sa>
writes:
> I am a student in physics. I am pretty new
> in pycuda. Currently I am interesting in finit volume methods running on
> multiple GPUS in a single node. I have not found relevant documentation
> related to this issue, specifically how to communicate different contexts
> or how to run the same kernel on different devices at the same time.
> Would you suggest me some literature/documentation about that?
I think the common approach is to have multiple (CPU) threads and have
each thread manage one GPU. Less common (but also possible, if
cumbersome) is to only use one thread and switch contexts. (FWIW,
(Py)OpenCL makes it much easier to talk to multiple devices from a
single thread.)
Lastly, if you're thinking of scaling up, you could just have one MPI
rank per device.
Hope that helps,
Andreas
I am a student in physics. I am pretty new
in pycuda. Currently I am interesting in finit volume methods running on
multiple GPUS in a single node. I have not found relevant documentation
related to this issue, specifically how to communicate different contexts
or how to run the same kernel on different devices at the same time.
Would you suggest me some literature/documentation about that?
Regards
--
------------------------------
This message and its contents, including attachments are intended solely
for the original recipient. If you are not the intended recipient or have
received this message in error, please notify me immediately and delete
this message from your computer system. Any unauthorized use or
distribution is prohibited. Please consider the environment before printing
this email.
Hi Chris,
Not sure what you're asking. The code you show doesn't apply--it uses
the 'runtime API' (cudaXyz...), PyCUDA uses the 'driver API'
(cuXyz...). And the piece of Peter's example that worries about
exchanging data with PyCUDA (lines 162-192) is about the same in
complexity as what you're showing.
Andreas
Chris Uchytil <uchytilc(a)uw.edu> writes:
> I am brand new to CUDA and OpenGL and I have found that tutorials and
> resources on a lot of this material is rather scares or not straight
> forward so I am hoping I can get some assistance here. I am working on a
> project attempting to convert some CUDA and OpenGL C++ code over to Python.
> The code is a basic Kernal that computes distance from a point (To emulate
> light on a wall from a flashlight) and sends the calculated array to OpenGL
> to display the light intensity. You can move your mouse/"Flashlight" around
> to move the light around on the screen. I have been successful in
> converting the Kernal code over to Python using the Numba python package.
> What I am having trouble with is the Open GL Interoperability stuff. I
> can't really find an info that describes the process of interop in a simple
> fashion so I'm not really even sure what the setup process is. It sounds
> like you need to create something called a pixel buffer and send that to
> the kernal. From what I can tell the C++ code uses this simple function to
> do this.
>
> // texture and pixel objects
> GLuint pbo = 0;
> GLuint tex = 0
> struct cudaGraphicsResource * cuda_pbo_resource;
>
> void render() {
> unchar4 *d_out = 0;
> cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);
> cudaGraphicsResourceGetMappedPointer((void**)&d_out, NULL,
> cuda_pbo_resource);
> kernelLauncher(d_out, W, H, loc);
> cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);
> }
>
>
> I can't find any info that describes the python equivalent of
> cudaGraphicsMapResources, cudaGraphicsResourceGetMappedPointer, and
> cudaGraphicsUnmapResources. I've found a GL interop example by by Peter
> Berrington (https://wiki.tiker.net/PyCuda/Examples/GlInterop) but it seems
> to me to be overly complicated in how it creates PBO's and textures and
> such when compared to the C++ code.
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> https://lists.tiker.net/listinfo/pycuda
I am brand new to CUDA and OpenGL and I have found that tutorials and
resources on a lot of this material is rather scares or not straight
forward so I am hoping I can get some assistance here. I am working on a
project attempting to convert some CUDA and OpenGL C++ code over to Python.
The code is a basic Kernal that computes distance from a point (To emulate
light on a wall from a flashlight) and sends the calculated array to OpenGL
to display the light intensity. You can move your mouse/"Flashlight" around
to move the light around on the screen. I have been successful in
converting the Kernal code over to Python using the Numba python package.
What I am having trouble with is the Open GL Interoperability stuff. I
can't really find an info that describes the process of interop in a simple
fashion so I'm not really even sure what the setup process is. It sounds
like you need to create something called a pixel buffer and send that to
the kernal. From what I can tell the C++ code uses this simple function to
do this.
// texture and pixel objects
GLuint pbo = 0;
GLuint tex = 0
struct cudaGraphicsResource * cuda_pbo_resource;
void render() {
unchar4 *d_out = 0;
cudaGraphicsMapResources(1, &cuda_pbo_resource, 0);
cudaGraphicsResourceGetMappedPointer((void**)&d_out, NULL,
cuda_pbo_resource);
kernelLauncher(d_out, W, H, loc);
cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0);
}
I can't find any info that describes the python equivalent of
cudaGraphicsMapResources, cudaGraphicsResourceGetMappedPointer, and
cudaGraphicsUnmapResources. I've found a GL interop example by by Peter
Berrington (https://wiki.tiker.net/PyCuda/Examples/GlInterop) but it seems
to me to be overly complicated in how it creates PBO's and textures and
such when compared to the C++ code.