What domain sizes are you studying in this problem? A 4-point stencil is
memory bound, so you shouldn't expect to outperform the STREAMs benchmark
(calculated using the appropriate reuse ratio depending on how far you
unroll the kernel).
Have you looked at Volkov's work on this problem? They have a very good
CUDA implementation for 3-D stencil operators, a lot of what they say
applies in 2-D: http://www.cs.berkeley.edu/~volkov/volkov10-parcfd.pdf
Also, you can't use 'top' as a reliable measure of computational performance
when analyzing numerical code. You need to work out the number of
floating-point instructions (or memory bandwidth) your CPU or GPU is capable
of per cycle and look at the requirements of your operator.
On Tue, Sep 20, 2011 at 10:21 PM, Robert L Cloud <rcloud(a)gmail.com> wrote:
I've done some analysis comparing CPU(on a nehalem) and GPU(on a tesla)
performance of PyOpenCL to parallel Cython using OpenMP. The performance of
PyOpenCL on the CPU(Intel Nehalem with AMD OpenCL 1.1) was very poor, even
slower than a single threaded Cython program. I realize that my OpenCL
implementation was fairly poor, but I expected performance to be a bit
better than it was.
The analysis is available here:
I'm hoping that someone can give some insight into how to improve it or why
it is so bad.
Also, I would like to run the analysis again with the Intel OpenCL driver,
but can't get PyOpenCL to recognize both Intel and AMD platforms, when I run
get_platforms it only shows AMD. Here is my siteconf.py file:
rcloud@Vertex:~/sources/pyopencl-2011.1.2$ cat siteconf.py
BOOST_INC_DIR = 
BOOST_LIB_DIR = 
BOOST_COMPILER = 'gcc43'
BOOST_PYTHON_LIBNAME = ['boost_python-gcc43-mt']
USE_SHIPPED_BOOST = True
CL_TRACE = False
CL_ENABLE_GL = False
CL_ENABLE_DEVICE_FISSION = True
CL_LIBNAME = ['OpenCL']
CXXFLAGS = 
LDFLAGS = 
thanks in advance,
Robert L Cloud
,,Warum willst du dich von uns Allen
Und unsrer Meinung entfernen?"
Ich schreibe nicht, euch zu gefallen;
Ihr sollt was lernen.
PyOpenCL mailing list