On Tue, 20 Sep 2011 17:19:50 -0500, Robert L Cloud <rcloud(a)uab.edu> wrote:
However, even for small domains, where most of everything should fit
cache, my program is far slower than an OpenMP program.
Just one more suggestion from my side: Try and do more per work item. It
might be that the AMD implementation has a fairly high setup cost for
each work item, and so having fewer (larger) ones is going to be
beneficial. In my experience, the AMD implementation gives performance
about as good as gcc, while Intel can be significantly better, depending
on what you're trying to do.