Your work is interesting! At the moment I have a half baked version of
partial evaluation inside of the translator. That only handles the
@optimizes variables. I guess a conceptually cleaner framework would
be to separate the two. It would help if peval, translator and codepy
overlap as little as possible in the work that they do. That means
most optimization should be done in peval.
For my code indexing to get coherent memory access is most important,
for this I would(if possible) like to have abstract objects that
handle* this abstractly. Lets say that for a reduction in which the
order is unimportant I want to be able to do "for index in
orange(array)" in such a way that orange is a python object that will
give me boring linear objects so the code still works there but during
optimization/translation it can be replaced by more elaborate options.
(1) How were you planning to deal with these things?
If you plan to just always unroll everything ( don't know if this is
desirable ) such things might need to happen before peval. Btw how do
I get peval to do that. At the moment I do:
def add(tgt:Pointer(numpy.float32), op1:Pointer(numpy.float32),
idx = threadIdx.x + thread_strides * block_size * blockIdx.x
for i in range(block_size):
tgt[idx+ i*thread_strides] = op1[idx+i*thread_strides] +
op2[idx + i*thread_strides]
#print( py2codepy(add,block_size=256,thread_strides=16) )
add_spec = partial_apply(add,block_size=256,thread_strides=16)._peval_source
add_spec = add_spec.replace("__binding_1","range(100)")
print ( py2codepy(add_spec))
(2) I quite liked my decorators to specify what is being optimized,
would it be possible to preserve this interface, peval should take the
outer decorators and ignore those that it doesn't know.
(3) add_spec.replace("__binding_1","range(100)"), how do I get peval
to unroll my loop for me?
(4) It might be good to skip the parsing step.
partial_apply(...).getAST() or something similar would be helpful.
(5) Would the threaded fenced reduction(cuda sample) be a good program
to demo eval->translate with it seem to have templates, indexing for
coherent memory access, and be relatively fundamental/important. Plus
it already has optimized c++ cuda code for comparison.
On Wed, Apr 23, 2014 at 11:26 PM, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
There are two similar projects I know of:
(Python to C/OpenCL/JS translator)
(Python to CUDA translator/compiler)
Both of them seem to be somewhat abandoned.
As a matter of fact, I've been tentatively working on a similar
project myself recently. Although I decided to start from implementing
a partial evaluation library (a big rework of an existing abandoned
project, to be accurate) to serve as a replacement for templates. The
translator would be the next step, when I had enough control on the
partial evaluation stage (e.g. could tell the evaluator to inline
python functions, unroll loops and so on). My main aim was to be able
to rewrite Reikna templates completely in Python (including
non-trivial template flow, template functions and macros they have in
The partial evaluator is located at https://github.com/Manticore/peval
. It is currently in working condition, but there's a big change
incoming that will allow me to separate different optimization
strategies it employs and make the control flow more predictable.
After that I expect I will be able to make a prototype of a
translator. The partial evaluation and translation stages are mostly
independent, so you may find it useful for your project too. Although
it would be nice to join our efforts on this task if you plan to
develop your translator further.
On Thu, Apr 24, 2014 at 6:40 AM, Max Argus <argus.max(a)gmail.com> wrote:
> I am looking for a python based metaprogramming library that will
> allow me to tune programs using python and then export the optimal
> cuda (for now) code to a file so that it can be used from a c++
> 1) What is the best choice for this?
> The metaprogramming systems associated with pycuda are templating
> based systems or codepy. Codepy seemed to be recommended by the
> However the problem that I encountered was that the codepy
> representation of code was only marginally understandable. Because I
> wasn't too keen on converting a bigger kernel into this format I had a
> look around. I noticed that the codepy classes were very similar to
> the python AST, for obvious reasons.
> Based on this I wrote an (incomplete) AST transformer that converted a
> python implementation of the codepy demo program into the codepy
> This is more or less easy enough for the simple program provided.
> However the question is if this is still a viable way to do things
> when kernels get more complicated. The python AST transformer will
> allow one to do pretty much every possible code transformation, in the
> end it probably comes down not if it will it be possible to add all of
> python to c translation hints/options e.g. loop unrolling ect. in a
> way that keeps the code legal python and readable.
> At the moment I think this is feasible since pretty much everything
> can be wrapped in (or preceded) by a (empty) function who's name can
> be used to influence AST transformer behavior.
> Next I will try to implement a 2D convolution.
> 2) Is there a frontend script for codepy that compiles and evaluates
> the kernel performance?
> At the moment this seems like the most pythonic way to do
> metaprogramming to me in terms of readability and flexibility, though
> not in ease of implementation.
> 3) In general does this approach seem sane to you guys, and if does
> what should it look like in order to be useful to others too?
> BR, Max
> PyCUDA mailing list