Thanks as always for your helpful comments!
First off, an unrelated point: You really
shouldn't ever modify loopy
kernels in-place, such as by assigning to
What would be the correct way of modifying the kernel such that it is
schedulable (without the in-place instruction attribute modification)?
Creating a copy with the modified instructions?
From: Andreas Kloeckner [mailto:firstname.lastname@example.org]
Sent: Friday, February 3, 2017 8:43 PM
To: Nick Curtis <nicholas.curtis(a)uconn.edu>du>; loopy(a)tiker.net
Subject: Re: [Loopy] Question on reductions
Nick Curtis <nicholas.curtis(a)uconn.edu> writes:
Rather than continue to clutter up the github repo
(issue #81) with
what is likely my own lack of understanding on reductions, I thought
I'd move my question to the listmail.
I have a kernel that functions similarly to the below, and I am
wondering if the sum reduction can be parallelized over 'i', that is
by splitting 'i' and assigning the inner tag to 'l.0'
After some fiddling, I have managed to get the kernel schedulable,
however I am encountering an error "instruction 'sum_i_outer_init'
does not use all local hw axes".
I believe this is most likely due to the fact that the end assignment
`a[0, j]` does not vary with 'i' (and hence the local hardware axis)
Is this reduction possible? Or even advisable?--I'm guessing only at
much larger sizes of 'i', but this would be an option that could be
turned on as the problem size 'i' grows.
First off, an unrelated point: You really shouldn't ever modify loopy
kernels in-place, such as by assigning to instruction attributes.
Essentially all of loopy assumes that kernels are immutable, for e.g.
caching. Doing this is bound to end in tears.
Next, what you're encountering is more a restriction of GPU programming
generally rather than one of loopy. The problem is that an entire kernel
needs to have a single work group size. In your example, while the reduction
over i_inner is parallel (and uses all elements of the work group), the
initialization of the i_outer reduction does not use all work group elements
(in loopy-speak: does not have an iname tagged l.0), and that's what the
complaint is about.
In your case, parallelizing over j is more promising, since parallel
reduction should be a last resort if all other source of parallelism are
exhausted, because of the (by comparison) low efficiency.
Hope that helps,