On Mon, 14 Feb 2011 21:03:55 +0100, Tomasz Rybak <bogomips(a)post.pl> wrote:
Dnia 2011-02-13, nie o godzinie 19:12 -0500, Andreas
On Mon, 14 Feb 2011 00:51:13 +0100, Tomasz Rybak
After discussion with Martin Laprise I have come
with the following code
(see attachment). It uses all available MPs, but I think it needs
some code to decide whether to use entire GPU (in case generated
vector is long) or only few blocks (otherwise).
I can fix attached code to better suit PyCUDA style so you can push
it to git, and only then try to add code managing number of used blocks.
Please work your changes into the branch I created. The changes there
concerned (much) more than style.
I have noticed - I like your solution.
BTW - you misspelled names of float2 and double2 CURAND functions;
I have fixed then in attached patch.
Also those functions (float2, double2) are available for XORWOW
generator, not for Sobol32 - unless I misunderstood purpose of variable
Whoops, looks like you're right.
Sobol' direction vectors need to come from a very specific set to
make sense, see curandGetDirectionVectors32 in the CURAND docs. We
should probably call/wrap this function to get those vectors. Further,
each generator should use a different vector, rather than the same
- The Sobol' initialization needs to be worked out. In particular, I
would like both generators to do something sensible if they're
initialized without arguments.
Agree on both points.
Ok, sounds good.
See attached patch.
I have added field self.block_count that equals to number of MPs,
and it is number of blocks that are run for generating random numbers.
Should I try to play with it and use less blocks for shorter sequences,
or just leave it as is? I would prefer leaving as is ;-) ; for
smaller generated sequences kernels are executed quickly, so potential
performance gains could not be worth sophisticated code.
I'd even go the opposite way and bump this 2-3 times the number of
SMs. Optimizing for large arrays is fair, I think. If someone needs 15
random numbers, they're hardly going to come running to the GPU for
Done in the revised version of your patch.
I have managed to use maximum number of threads on
Tesla - during
initialisation I am just calling 2*blocks, each initialising
only half of generators that are used for one block. Test case
worked on ION, and sample program worked on Martin Laprise machine,
so I believe this is good solution.
Sounds ok. On my Tesla, the double2 kernel runs into trouble for lack of
registers. I've thus bumped down generators_per_block down by a factor
of 2 from the maximum.
After you pull changes into git I will start working
Pulled with a few changes as detailed above, still on