On Fri, Apr 25, 2014 at 4:24 AM, Max Argus <argus.max(a)gmail.com> wrote:
For my code indexing to get coherent memory access is
for this I would(if possible) like to have abstract objects that
handle* this abstractly. Lets say that for a reduction in which the
order is unimportant I want to be able to do "for index in
orange(array)" in such a way that orange is a python object that will
give me boring linear objects so the code still works there but during
optimization/translation it can be replaced by more elaborate options.
(1) How were you planning to deal with these things?
I was not really planning anything more complicated than just a
translation from n-dimensional array indices to a flat pointer to GPU
memory. That is, I was only hoping to achieve a more convenient way to
write kernels, in Python instead of a Mako + C mixture I use now in
Reikna. The optimizations you are doing seem interesting, but I think
they are quite separate from the translation process itself.
If you plan to just always unroll everything (
don't know if this is
desirable ) such things might need to happen before peval. Btw how do
I get peval to do that. At the moment I do:
Currently there is no unrolling functionality (as I mentioned earlier,
I'm currently making some architecture change to simplify the addition
of new features). I was planning something along the lines of:
1) automatic unrolling (heuristic is yet unclear)
2) force unrolling — recognizing the code like "for i in unroll(range(n)):"
(2) I quite liked my decorators to specify what is
would it be possible to preserve this interface, peval should take the
outer decorators and ignore those that it doesn't know.
Peval preserves all the decorators at the moment (at least it should).
There are some caveats with the decorators though (due to the way
peval discovers the function code), see
add_spec.replace("__binding_1","range(100)"), how do I get peval
to unroll my loop for me?
See above. Currently this feature is not implemented.
(4) It might be good to skip the parsing step.
partial_apply(...).getAST() or something similar would be helpful.
partial_apply() returns a normal callable function, with the proper
signature, globals, closure and so on, so adding some getAST() method
is not really desirable. The problem here is that the source of this
function cannot be discovered by inspect.getsource(), and therefore
you cannot parse it and get its AST. What can be done is exposing the
internal Function class which, in addition to encapsulating some
Python magic of extracting global and closure variables, knows where
to look for the source of the function object constructed by peval. It
has a 'tree' attribute containing the AST.
(5) Would the threaded fenced reduction(cuda sample)
be a good program
to demo eval->translate with it seem to have templates, indexing for
coherent memory access, and be relatively fundamental/important. Plus
it already has optimized c++ cuda code for comparison.
Yes, I think it will be a very good example. It can also demonstrate
passing and using a custom predicate (as another GPU function) and
working with arbitrary structures instead of integers/floats.