Two more quick points...
If I let the code keep running on the ION2 system I get this:
And... if I set the environment variable to show compiler output on the ION2 system.. I
I'm struggling to interpret what that all means. ;-)
Any hints appreciated.
BTW... is there a 'release' memory method needed when using pyopencl?
Do I need to create my context/queue only once and pass it around to be reused all the
On Jan 27, 2012, at 6:24 AM, Steve Spicklemire <steve(a)spvi.com> wrote:
More on this saga. ;-)
Short story.. I *think* I'm having memory management trouble... but I'm not sure
how, or how to track it down.
I've changed my code a fair amount after getting a bit more educated WRT GPU
I've got two systems I'm testing on, my laptop (15" macbook pro, NVIDIA
GeForce GT 330M 512 MB) and a baby cluster I've built using BCCD (6x debian intel atom
itx boards with ION2 graphics built-in).
The laptop is more portable. ;-)
I decided to try to use ranluxcl directly inside a custom kernel rather than the cl.rand
module (but I read the source and tried to use that as an example of it's use).
I'm still using the ReductionKernel class to get the final result.
Here's the code I'm running on the mac:
And here are the results....
It runs to completion... but notice that the 'random' numbers aren't behaving
randomly! I thought the period of ranlux was very large.. so I'm puzzled.
Next... when I run this code:
on one of the cluster nodes.. I get this:
Wacky! Same code (more or less... just startup is different).
If I let it keep running it will eventually say "Host memory exhausted" or
some-such. By "host" I'm assuming it means the CPU, not the GPU right? Very
little host memory involved I think... it's almost entirely on the GPU... but anyway,
doesn't memory get freed when the function exits and the local python variables go out
of scope? Mysterious!
I'm pretty sure I'm still missing some basic rule/concept about pyopencl... any
On Jan 20, 2012, at 10:55 AM, Andreas Kloeckner wrote:
>> I guess I was hoping for a significant speedup going to a GPU
>> approach. (note I'm naturally uninterested in the actual value of pi!
>> I'm just trying to figure out how to get results out of a GPU. I'm
>> building a small cluster with 6 baby GPUs and I'd like to get smart
>> about making use of the resource)
>> I'm also a little worried about the warning I'm getting about
>> query SMD group size". Looking at the source it appears the platform
>> is returning "Apple" as a vendor, and that case is not treated in the
>> code that checks.. so it just returns None. When I run
>> 'dump_properties' I see that the max group size is pretty big!
>> Anyway.. I'll try your idea of using enqueue_marker to try to track
>> down what's really taking the time. (I guess 60% of it *was*
>> generating excessivly luxurious random numbers!) But I still feel I
>> should be able to beat the CPU by quite a lot.
> export COMPUTE_PROFILE=1
> and rerun your code. The driver will have written a profiler log file
> that breaks down what's using time on the GPU. (This might not be true
> on Apple CL if you're on a MacBook, not sure if that provides an
> equivalent facility. If you find out, please report back to the list.)
> Next, take into account a GT330M lags by a factor of ~6-7 compared to a
> 'real' discrete GPU, firstly in mem bandwith (GT330M: 25 MB/s, good
> discrete chip: ~180 MB/s), and, less critically, in processing
> power. Also consider that your CPU can probably get to ~10 MB/s mem
> bandwidth if used well.