trying to fix a bottleneck in my code
by Guilherme Gonçalves Ferrari

Dear all,
I am working on a O(N^2) N-body code and I decided to write everything in
python/numpy and to use pyopencl only to accelerate the gravitational force
calculation.
In a naive approach for time integration of equations of motion one could
simply use a constant time-step for all particles. Therefore the CL kernel
for force calculation is always called with the same (total) number of
particles, namely N. This works fine since the calculation cost is [for
reasonable values of N (>10^3)] much larger than the communication cost.
However, if one wants to use a smarter approach to the time integration of
the system one needs to use a sort of individual time-step scheme (in my
case, a power-of-2 block time-step scheme). In this case I need to call the
CL kernel with a different number of particles N_L (N_L <= N) always that a
subset of particles in a given time-step level L needs to be integrated.
So, after make sure that everything works properly, now I am trying to fix
some bottlenecks of my implementation. In the function for force
calculation I do something like this:
...
acc_kernel = kernel_library.get_kernel("p2p_acc_kernel")
acc_kernel.set_kernel_args(*data, global_size=global_size,
local_size=local_size,
result_shape=result_shape,
local_memory_shape=local_memory_shape)
acc_kernel.run()
result = acc_kernel.get_result()
...
where 'data' is a tuple of numpy arrays.
After some profiling I have figured out that the main bottleneck is in the
'set_kernel_args' method, where I have two calls to
cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=hostbuf)
and one call to
cl.Buffer(ctx, mf.WRITE_ONLY, size=kernel_result.nbytes)
I know that to call cl.Buffer always that I need to call the CL kernel is
quite inefficient but that was the only way that I found to get the right
result.
So, my questions are:
i) How can I avoid this problem without a complete refactoring of my code
implementation (I don't want to lose too much of the flexibility of my
code)?
ii) Is it possible to allocate the maximum size [which is O(N)] of the
device memory at beginning of the simulation and still be able to call the
CL kernel with different values of number of particles N_L (N_L <= N)? If
so, what I need to do to be able to call the kernel with, for example,
number of particles N_L = N/2 or N/4 or whatever it? What about the
'result' array? Will I get always an O(N) 'result' array or it will be
O(N_L)?
thanks in advance!
and apologize for my English...
Guilherme