Hi Loic,
Loic Gouarin <loic.gouarin(a)math.u-psud.fr> writes:
> thanks for the explanation and for the seq_dependencies flag.
>
> I continue to create my kernel and I have a behavior that I don't
> understand. I attach the python script.
>
> I would like to store the result in the f_new array. If I create a
> temporary array the code works but if I want to return the result in a
> f_new array with the same size of f the code don't stop or do something
> that I don't understand (pleas uncomment the f_new lines in the kernel
> to see the problem).
>
> Could you tell me what happen and how can I debug the code ?
Heh, turns out 'seq_dependencies=True' was a dumb idea after all. :)
The reason is that it created a bunch of (totally unnecessary)
write-after-write dependency edges between your writes to `f_new` that,
since the writes are to global memory, necessitate a global
barrier. Since OpenCL doesn't provide such a thing, the only way to do
that is to end the kernel and start a new one. A (new and currently
experimental) subsystem of Loopy then took off and split the kernel into
lots of tiny little pieces, adding spill-and-restore code for all the
data that the kernel had in local or private memory at that time. That
operation took a very long time and led to a poor user experience. We're
tracking that separately, here:
https://github.com/inducer/loopy/issues/35
For your situation, a much better approach is to avoid introducing these
dependencies in the first place, since they're unnecessary. I've
attached a version of the code that does that, and I've also added an
FAQ item on how to do this:
https://documen.tician.de/loopy/misc.html#specifying-dependencies-for-group…
Full disclosure: There was a bug in how this was handled previously (now
fixed), so you'll need to update loopy in order for the attached code to
work.
> PS: I don't send this email to the list because it's really specific to
> my problem and I will probably send you several emails to have a working
> code.
Do keep the list cc'd. Even if your code is specific to your situation,
someone else may have a similar problem and might benefit from following
the discussion. That is, unless you're concerned about disclosing your
code. If that's a concern, you should let me know, since I've just
included a copy of your code in loopy's automated test suite. :)
Andreas

Hi Andreas,
I met you during your tutorial about loopy at Orsay in France. We worked
together about lattice Boltzmann methods by using loopy.
I try to implement a first version and I have several questions.
Here is my first kernel (sorry it is a little big long)
LBM_kernel = lp.make_kernel(
"{[i,j]:1<=i<nx-1 and 1<=j<ny-1}",
"""
# get f
<> floc[0] = f[i-1, j, 0]
floc[1] = f[i, j-1, 1]
floc[2] = f[i+1, j, 2]
floc[3] = f[i, j+1, 3]
floc[4] = f[i-1, j, 4]
floc[5] = f[i, j-1, 5]
floc[6] = f[i+1, j, 6]
floc[7] = f[i, j+1, 7]
floc[8] = f[i-1, j, 8]
floc[9] = f[i, j-1, 9]
floc[10] = f[i+1, j, 10]
floc[11] = f[i, j+1, 11]
# f2m
<> m[0] = + floc[0] + floc[1] + floc[2] + floc[3]
m[1] = + 4.*floc[0] - 4.*floc[2]
m[2] = + 4.*floc[1] - 4.*floc[3]
m[3] = + floc[0] - floc[1] + floc[2] - floc[3]
m[4] = + floc[4] + floc[5] + floc[6] + floc[7]
m[5] = + 4.*floc[4] - 4.*floc[6]
m[6] = + 4.*floc[5] - 4.*floc[7]
m[7] = + floc[4] - floc[5] + floc[6] - floc[7]
m[8] = + floc[8] + floc[9] + floc[10] + floc[11]
m[9] = + 4.*floc[8] - 4.*floc[10]
m[10] = + 4.*floc[9] - 4.*floc[11]
m[11] = + floc[8] - floc[9] + floc[10] - floc[11]
# relaxation
# TODO
# m2f
# TODO
# set f
f_new[i, j, 0] = floc[0]
f_new[i, j, 1] = floc[1]
f_new[i, j, 2] = floc[2]
f_new[i, j, 3] = floc[3]
f_new[i, j, 4] = floc[4]
f_new[i, j, 5] = floc[5]
f_new[i, j, 6] = floc[6]
f_new[i, j, 7] = floc[7]
f_new[i, j, 8] = floc[8]
f_new[i, j, 9] = floc[9]
f_new[i, j, 10] = floc[10]
f_new[i, j, 11] = floc[11]
""",
[
lp.GlobalArg("f_new", shape="nx, ny, nv"),
lp.GlobalArg("f", shape="nx, ny, nv"),
lp.ValueArg("nx", np.int32),
lp.ValueArg("ny", np.int32),
lp.ValueArg("nv", np.int32),
]
)
When I generate the openCL code the code ordering is not conserved as
mentioned in the doc. I see in the documentation that we can set id and
dep flags. But is it possible to use these flags when you have several
lines. In my example, I would like to have the following:
- get f
- f2m
- set f
My other question is concerned the call of kernels inside a kernel. I
didn't see any examples in the documentation that xan do that. Is it
possible for example to create a function get_f and to call that
function into LBM_kernel ? If it is possible, is it also possible to
have inlining in the generated code ?
Thanks,
Loic
--
Tel: 01 69 15 60 14
http://www.math.u-psud.fr/~gouarinhttps://github.com/gouarin