[PyCUDA] CURAND wrappers - Long pause after generation of the a random matrix
Martin Laprise
mlaprise.ml at gmail.com
Thu Jan 20 06:37:36 PST 2011
Thanks for your explanation. I don't think your wrapper is too simple. I
think it's a very good start and It saved me a lot of work. I was just
surprised by the performance of CURAND itself which advertised a speedup
around 10-20X. But, I think with your below code running on a Fermi, we are
quite near the advertised speedup. I ran the code on my GTX 260. I was not
able to use N = 100000000 (cuLaunchGrid failed: out of memory) but with N =
50000000 I've got:
Bench 1: 1.75797915459 sec
Bench 2: 0.303898096085 sec
impressive !
On Thu, Jan 20, 2011 at 8:19 AM, Tomasz Rybak <bogomips at post.pl> wrote:
> Dnia 2011-01-18, wto o godzinie 15:45 -0500, Martin Laprise pisze:
> > Thank you for your help. In fact, it was one of my initial guess that
> some
> > stuff happen asynchronously. I saw the same result by using "del rr". But
> I
> > was septic because it make the performance not that impressive. Here is a
> > comparison with numpy.random.randn() on a Q6600 @ 3.2 Ghz:
> >
> > Bench GPU 2: 4.41615891457 sec
> > Bench CPU: 7.45132184029 sec
> >
> > Did I miss something ?
>
> No.
> First - I think NVIDIA introduced CURAND to avoid generating random
> values and transferring them over to GPU - which would give you some
> performance benefits.
> Second - you already have performance gain.
> Third - my wrappers around CURAND are rather simple (maybe too simple);
> because they use only one multiprocessor. I just decided to use one MP
> because of troubles with too many threads during initialisation
> on Tesla.
> I have come with below code (warning - this is for now rather dirty
> hack, but works for me on both GTX 460 Fermi, and ION 9400 Tesla):
>
> mp_source = """
> // Uses C++ features (templates); do not surround with extern C
> #include <curand_kernel.h>
>
> extern "C" {
> __global__ void prepare_with_seed(curandState *s, const int n, const int
> seed, const int offset) {
> int id = blockIdx.x*blockDim.x+threadIdx.x;
> if (id < n) {
> curand_init(seed, threadIdx.x, offset, &s[id]);
> }
> }
> __global__ void normal_float(%(data_type)s *s, float *d, const int n) {
> int id = blockIdx.x*blockDim.x+threadIdx.x;
> const int tidx = threadIdx.x;
> for (int idx = id; idx < n; idx += blockDim.x*gridDim.x) {
> d[idx] = curand_normal(&s[id]);
> }
> }
> }
> """
>
> def testMP(data, input_size, stream = None):
> import pycuda.compiler
> import pycuda.driver
> import random
> if pycuda.driver.get_version() < (3, 2, 0):
> raise EnvironmentError("Need at least CUDA 3.2")
> dev = pycuda.driver.Context.get_device()
> block_size =
> dev.get_attribute(pycuda.driver.device_attribute.MAX_THREADS_PER_BLOCK)
> block_dimension =
> dev.get_attribute(pycuda.driver.device_attribute.MAX_BLOCK_DIM_X)
> generator_count = min(block_size, block_dimension)
> mp_count =
> dev.get_attribute(pycuda.driver.device_attribute.MULTIPROCESSOR_COUNT)
> state = pycuda.driver.mem_alloc(generator_count *
> curand_state_size)
> seed = random.randint(0, ((1 << 31) - 1))
> source = str(mp_source) % {
> 'data_type': 'curandState',
> }
> module = pycuda.compiler.SourceModule(source, no_extern_c=True,
> keep=True)
> p = module.get_function("prepare_with_seed")
> if dev.compute_capability() >= (2, 0):
> p.prepare("Piii", block=(generator_count, 1, 1))
> p.prepared_call((mp_count, 1), state, generator_count, seed,
> 0)
> else:
> # Ugly hack for non-Fermi
> p.prepare("Piii", block=(generator_count/2, 1, 1))
> p.prepared_call((mp_count*2, 1), state, generator_count,
> seed, 0)
> normal_float = module.get_function("normal_float")
> normal_float.prepare("PPi", block=(generator_count, 1, 1))
> normal_float.prepared_async_call((mp_count, 1), stream, state,
> data, input_size)
>
> if __name__ == '__main__':
> import pycuda.gpuarray
> import pycuda.autoinit
> import time as clock
> import numpy
> N = 100000000
> rr = PseudoRandomNumberGenerator(0,
> numpy.random.random(256).astype(numpy.int32))
> cuda_stream = pycuda.driver.Stream()
>
> t1 = clock.time()
> # GPU
> data = pycuda.gpuarray.empty([N], numpy.float32)
> rr.fill_normal_float(data.gpudata, N, stream=cuda_stream)
> cuda_stream.synchronize()
> t2 = clock.time()
> print "Bench 1: " + str(t2-t1) + " sec"
> del data
> t1 = clock.time()
> # GPU
> data = pycuda.gpuarray.empty([N], numpy.float32)
> testMP(data.gpudata, N, stream=cuda_stream)
> cuda_stream.synchronize()
> t2 = clock.time()
> print "Bench 2: " + str(t2-t1) + " sec"
>
>
> It uses all MPs it can find and gives following results:
> ION (had to descrease N to 90M):
> Bench 1: 3.946
> Bench 2: 3.911
> not much improvement, but ION has only 2 MPs.
>
> GTX 490 (N = 100M):
> Bench 1: 1.13427710533 sec
> Bench 2: 0.22767996788 sec
> So here you have some improvement, although only 5x, not 7x as would
> be suggested by 7MPs on GTX 460. Please test on your GTX 280,
> which should give results between ION and GTX 460.
>
> But I would like for Andreas to include first version
> of CURAND wrappers, and only then start working on optimisations.
>
>
>
> >
> > Martin
> >
> > On Tue, Jan 18, 2011 at 2:56 PM, Tomasz Rybak <bogomips at post.pl> wrote:
> >
> > > Dnia 2011-01-18, wto o godzinie 09:46 -0500, Martin Laprise pisze:
> > > > Hi, I just made some experiments with the CURAND wrappers. It seem to
> > > work
> > > > very nicely except for a little detail that I can't figure out. The
> > > > initialization of the generator and the actual random number
> generation
> > > seem
> > > > very fast. But for what ever reason, PyCUDA take a long time to
> "recover"
> > > > after the number generation. This pause is significantly longer than
> the
> > > > actual computation and the delay increase with N. Here is an example:
> > > >
> > >
> > > curand kernels are called asynchronously.
> > > This means that PyCUDA returns immediately after
> > > initiating the call, and does not wait for result.
> > > This allows hardware or drive to better manage
> > > order of execution, and to run many kernels concurrently
> > > on modern hardware (2.x capabilities).
> > >
> > > After changing your code to force PyCUDA to wait I got
> > > following results:
> > >
> > > import numpy as np
> > > import pycuda.autoinit
> > > import pycuda.gpuarray
> > > from pycuda.curandom import PseudoRandomNumberGenerator,
> > > QuasiRandomNumberGenerator
> > > import cProfile
> > > import time as clock
> > >
> > >
> > > cuda_stream = pycuda.driver.Stream()
> > >
> > > def curand_prof():
> > >
> > > N = 100000000
> > >
> > > t1 = clock.time()
> > > # GPU
> > > rr = PseudoRandomNumberGenerator(0,
> > > np.random.random(128).astype(np.int32))
> > > data = pycuda.gpuarray.empty([N], np.float32)
> > > rr.fill_normal_float(data.gpudata, N, stream=cuda_stream)
> > > cuda_stream.synchronize()
> > > t2 = clock.time()
> > > print "Bench 1: " + str(t2-t1) + " sec"
> > >
> > >
> > > if __name__ == "__main__":
> > > t4 = clock.time()
> > > curand_prof()
> > > t5 = clock.time()
> > > print "Bench 2: " + str(t5-t4) + " sec"
> > >
> > > Bench 1: 1.15405488014 sec
> > > Bench 2: 1.15947508812 sec
> > >
> > > It seems consistent with your results - I was running on GTX 460
> > > with Fermi. Your GTX 260 is Tesla, so 256 threads are used;
> > > Fermi uses 1024 threads, which uses 4 times less time to compute
> > > random numbers.
> > >
> > > Best regards, thanks for noticing this, and thanks for testing
> > > CURAND wrapper.
> > >
> > > --
> > > Tomasz Rybak <bogomips at post.pl> GPG/PGP key ID: 2AD5 9860
> > > Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
> > > http://member.acm.org/~tomaszrybak
> > >
>
>
> --
> Tomasz Rybak <bogomips at post.pl> GPG/PGP key ID: 2AD5 9860
> Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
> http://member.acm.org/~tomaszrybak
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20110120/3ccdda2c/attachment-0001.html>
More information about the PyCUDA
mailing list