thank you very much for your reply, Andreas.
As you mention, this is very likely a bug in your code. I
don't have time to go in and debug it for you, but I've got a few hints
that might help:
- Try and use valgrind on your CPU code. Unfortunately, that spews a lot
of unrelated nonsense for both the Intel and AMD OpenCL runtimes, but
it has helped me a few times.
Ok I tried that but was not able to find any
useful information yet.
- Have you tried with both Intel's and AMD's CPU runtime?
Yes, and on both Intel's and AMD's runtime the kernel runs as expected.
- AMD's CPU runtime allows you to use gdb (in 32 bit). See their
documentation for how to do that.
I have a 64 bit OS and gdb seems not work with
the PyOpenCL stuff.
Because of that, we tested the whole kernel code in C and debugged it.
Everything seems right (see attachment) - for very large data sets, too.
So the code, when executed instance by instance, works fine,
but executed in parallel on nVidia GPU it crashes.
Is there a chance that the code could work with other work group sizes?
I could think of a problem of resource allocation when many kernel
instances are created.