I realize that this list is primarily for GPU computing, PyOpenCL being a
descendent of PyCUDA, and I am aware of Vasiliy's work on optimizing matrix
operations and know that he is very well respected in the GPGPU community.
I have done an amount of GPU programming in the past, but in my current
work environment, we are using traditional(funny how beowulf machines are
now considered traditional) clusters with MPI with little to no shared
memory programming. I'm trying to make arguments to change this, arguing
that we should leverage heterogeneous environments for our codes.
Right now I am primarily concerned with making the argument that OpenCL is a
viable option for numerical computing on shared memory multicore computers.
But the results I am seeing are not indicating this. This may be due(and
probably is) to my inadequacy as a programmer, or it may be that the current
implementations of OpenCL for the CPU are not utilizing the full resources
of the computer. Perhaps there is a better forum for discussing OpenCL on
the CPU, but it is still an immature language and I thought I might find
some insite on this list.
I know that 'top' is not a means to measure the efficiency of a numerical
algorithm, but I am under the understanding that it can measure the
occupancy of the CPU. It was troubling to me to see that the Python process
was utilizing nearly 800% of the processors(on my 8 core Xeon) whereas the
PyOpenCL was only utilizing 400-500% and no other significant CPU programs
were running concurrently.
I used the same program semantics on both the CPU and GPU. It could be(but
I doubt it) that the poor performance is as a result of creating memory
buffers rather than directly using host allocated memory. However, the
copies are only performed after ever several hundred iterations and
shouldn't be the bottleneck of the program.
I have only looked at domains of up to 1000 x 1000, and for smaller domains,
up to 500 x 500, my OpenCL program is faster than a sequential Cython
implementation, probably due to being able to fit most of the matrix in
Cache. At larger sizes, though, the OpenCL runtimes explode.
I've read two books on OpenCL(Gaster's and Munshi's) and at least
states that for the CPU, you should let the OpenCL driver chose the
appropriate work group size. However I am wondering that if I am getting
many cache misses, whether it would be beneficial to break them up myself.
However, even for small domains, where most of everything should fit into
cache, my program is far slower than an OpenMP program.
Anyway, you will have to pardon any stupidity or unsophistication on my
part, as I am from Alabama :)
Robert L Cloud
Student, School of Engineering
The University of Alabama at Birmingham