Since ElementwiseKernels usually fetch each
exactly once, there likely isn't much in the way of
savings to be had.
Are you sure? In this set of applications, each array entry is read 4 times, minimum.
This is a stencil computation for 2D finite difference method. This code is up and
running and a CPU version was used to check results.
If the abstraction must fall by the wayside then it will fall. The results come first. I
don't care about beauty of Elementwise for its own sake, although I can understand if
you do (and I don't know that you do.) Incremental change is the best way forward.
Scrapping it and starting over again with SourceModule or worse pure CUDA C, is possibly
unnecessary just to get access to a little bit of the shared memory today.
What I probably failed to mention was that I'm not writing a library module for
someone else to enjoy reading and using. Don't be afraid of breaking beautiful
abstractions to make an application program more efficient, because maybe there was a
reason for it at the end of the day. I will be the judge of that.
Thanks for further insight on shared memory access inside an ElementwiseKernel.
From: Andreas Kloeckner <lists(a)informa.tiker.net>
To: Geoffrey Anderson <mrcoder(a)yahoo.com>; "pycuda(a)tiker.net"
Sent: Sunday, April 21, 2013 6:19 PM
Subject: Re: [PyCUDA] shared memory as next step in performance with ElementwiseKernel
Geoffrey Anderson <mrcoder(a)yahoo.com> writes:
So I've got this program using Elementwise and I want to up the
performance one more level. Nobody to my knowledge has written about
using shared memory, but that does not mean it can't be done in an
Elementwise program. How can shared memory be used in an elementwise
program without completely rewriting the thing as SourceModule? That
is, how to get an incremental improvement in my existing
ElementwiseKernel program, with the least code change?
I suspect shared memory is the key. I have lots of array work in my
program, naturally. To use shared memory, I imagine that the program
would need to detect how many i's per block there are, because shared
memory is block scoped (by i I mean the magic i that's passed in by
the pycuda system to an ElementwiseKernel), and this value would be
used as the size of the array of shared memory to be allocated. I'm
also not sure which thread should allocate the memory; probably only
one thread per block should do this but I don't know how that could be
achieved. Is the thread having the index 0 for x be the key here? And
how would an ElementwiseKernel reference that x value?
Three comments on this:
- I feel like shared memory isn't a good fit for the abstraction
presented by ElementwiseKernel, which deliberately hides various
details from you, including the thread block size being used. Since
you really need to know about thread blocks to make use of shared
memory, including it would make the abstraction (more) leaky.
Not a good thing.
- ElementwiseKernel really isn't magic. :) All it does is paste your
code into this here:
and then run the resulting kernel with a thread block size computed by
- If your code fits into ElementwiseKernel, then I'm not sure you'll see
much gain from using shared memory. Shared memory is good to help
avoid redundant fetches. Since ElementwiseKernels usually fetch each
array entry exactly once, there likely isn't much in the way of
savings to be had.