On Sat, 28 Jan 2012 18:21:29 -0500, Thomas Wiecki <Thomas_Wiecki(a)brown.edu> wrote:
> I am currently revisiting this but having some problems with the
> random number generator.
>
> generator.generators_per_block is 512 on my card, so I initialize 512
> generators, but I see that some of them don't produce random numbers
> when sampling from them. I notice that in some subtle ways (mainly
> that the distribution is not correct or all numbers are the same) if I
> sample from more than 300-350 generators (always the last ones are
> affected), but it's fine when using e.g. 128. So it seems I can only
> use a smaller number of generators than what the card says I should be
> able to use.
>
> Any idea on why that might be or how to investigate this further?
Mysterious. What generator is this using? XORWOW? Tomasz, any ideas?
Andreas
Great, thanks!
Andreas
On Tue, 17 Jan 2012 08:07:19 +0100, Thomas Wiecki <Thomas_Wiecki(a)brown.edu> wrote:
> Not following that list, but I hope it got resolved.
>
> This works fine now for me. Danke!
>
> Thomas
>
> On Tue, Jan 17, 2012 at 2:31 AM, Andreas Kloeckner
> <lists(a)informa.tiker.net> wrote:
> > On Sat, 14 Jan 2012 12:45:12 +0100, Thomas Wiecki <Thomas_Wiecki(a)brown.edu> wrote:
> >> On Fri, Jan 13, 2012 at 10:13 PM, Andreas Kloeckner
> >> <lists(a)informa.tiker.net> wrote:
> >> > You mean uintp (not uintp32), right? I've made that fix in compyte. Can
> >> > you please verify? (requires a submodule update, fixed in both PyOpenCL
> >> > and PyCUDA)
> >>
> >> Yes, that's a typo.
> >>
> >> > I was a bit unsure what C type to map this to, but decided in favor of
> >> > uintptr_t, even though that requires the user to have stdint.h included,
> >> > which none of the other types do. Hope that's ok, but I am open to
> >> > suggestions.
> >>
> >> The current fix doesn't work for me:
> >> CompileError: nvcc compilation of /tmp/tmp2ru5rp/kernel.cu failed
> >> [command: nvcc --cubin -arch sm_11
> >> -I/usr/local/lib/python2.7/dist-packages/pycuda-2011.2.2-py2.7-linux-i686.egg/pycuda/../include/pycuda
> >> kernel.cu]
> >> [stderr:
> >> kernel.cu(7): error: identifier "uintptr_t" is undefined
> >
> > Can you try again now? Sorry for the wait. If you follow the PyOpenCL
> > list, you'll know what held me up. :(
> >
> > Andreas
>
On Mon, 12 Dec 2011 21:37:38 -0500, Yifei Li <yifli82(a)gmail.com> wrote:
> Hi all,
>
> It is said in CUDA 4.0 programming guide that "Blocks are organized into a
> one-dimensional, two-dimensional, or three-dimensional
> grid of thread blocks". Does PyCuda currently support 3-dimensional
> grid?
Nope, this will happen once PyCUDA switches over to stateless launches
internally, as described in my previous email. CUDA's stateful launch
interface is 2D-grid-only, and PyCUDA uses that internally for greater
backwards compatibility.
Andreas
On Wed, 14 Dec 2011 09:15:19 -0500, Thomas Wiecki <Thomas_Wiecki(a)brown.edu> wrote:
> This is getting very weird. I went into the function with pdb now.
> np.dtype('uint32') is in DTYPE_TO_NAME but for some reason it fails to
> look it up:
>
> KeyError: dtype('uint32')
> > /usr/local/lib/python2.7/dist-packages/pycuda-2011.2.2-py2.7-linux-i686.egg/pycuda/compyte/dtypes.py(104)dtype_to_ctype()
> 103 print np.dtype('uint32') in DTYPE_TO_NAME
> --> 104 print DTYPE_TO_NAME[dtype]
> 105 raise ValueError, "unable to map dtype '%s'" % dtype
>
> ipdb> dtype
> dtype('uint32')
> ipdb> np.dtype('uint32') == dtype
> True
> ipdb> DTYPE_TO_NAME(np.dtype('uint32'))
> 'unsigned'
> ipdb> DTYPE_TO_NAME[dtype]
> *** KeyError: dtype('uint32')
I'm as confused as you. Can you go up the call stack and see who made
that dtype, and how?
Andreas
On Sat, 28 Jan 2012 09:51:20 +1100, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> Hi Andreas,
>
> On Sat, Jan 28, 2012 at 3:23 AM, Andreas Kloeckner
> <lists(a)informa.tiker.net> wrote:
> > Indeed, inserting __syncthreads() after the
> > shared array declaration brings the error down to more reasonable values
> > for me. Jesse, my recommendation would be to use that as a workaround
> > while we figure out a more permanent fix.
>
> Can't we do this:
> >> 1. Using "extern __shared__ out_type sdata[]" and setting the size of
> >> shared memory when preparing the kernel.
> We can pass dtype instead of ctype to
> get_reduction_kernel_and_types(), and convert it to ctype + data size
> inside.
This issue has 'ick' written all over it. Unfortunately,
out_type sdata[]
doesn't appear to quite cut it, as Thrust uses this bit of code here:
http://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/ext…
which casts form int4 and apparently serves to ensure alignment. (Nathan
pointed me to this.) I'm not sure why the native type wouldn't quite be
correctly aligned, so I guess I'm not fully understanding...
Andreas
I am currently revisiting this but having some problems with the
random number generator.
generator.generators_per_block is 512 on my card, so I initialize 512
generators, but I see that some of them don't produce random numbers
when sampling from them. I notice that in some subtle ways (mainly
that the distribution is not correct or all numbers are the same) if I
sample from more than 300-350 generators (always the last ones are
affected), but it's fine when using e.g. 128. So it seems I can only
use a smaller number of generators than what the card says I should be
able to use.
Any idea on why that might be or how to investigate this further?
Thanks,
Thomas
On Thu, Jan 12, 2012 at 3:44 PM, Tomasz Rybak <tomasz.rybak(a)post.pl> wrote:
> On Sun, 2011-12-18 at 20:25 +0100, Thomas Wiecki wrote:
>> I think it just allocates the maximum number. Previously I wondered
>> how I could find this maximum number, it is stored in
>> generator.generators_per_block (=512 on my card).
>
> It generates as many for each multiprocessor.
> In line 413 of curandom.py code allocates generators_per_block
> times block_count generators. block_count variable is set
> in line 349, getting number of multiprocessors.
>
> More wide note - do you see something that can be added to documentation
> or classes? They were written assuming simple use case (generating
> array of random numbers) so if you can propose some other use
> cases I think we can discuss it here and add this functionality.
>
> Best regards.
>
> --
> Tomasz Rybak GPG/PGP key ID: 2AD5 9860
> Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
> http://member.acm.org/~tomaszrybak
>
Hi Bogdan,
On Fri, 27 Jan 2012 13:27:04 +1100, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> Hello,
>
> As far as I understand, this is somehow connected with pycuda::complex
> having default constructor. When you write (in reduction kernel):
>
> __shared__ out_type sdata[BLOCK_SIZE];
>
> and "out_type" has the default constructor, it is called from _every_
> thread for _every_ value of the array. The reduction kernel starts
> like:
>
> __shared__ out_type sdata[BLOCK_SIZE];
> ... collecting values ...
> sdata[tid] = <some value>
>
> which means that there is a race between the default constructor and
> actual initialization. I am not sure why nvcc does not complain about
> this (bug?), but the following things do help:
> 1. Using "extern __shared__ out_type sdata[]" and set the size of
> shared memory when preparing the kernel.
> or
> 2. Putting "__syncthreads()" between default initialization and actual
> initialization (not very good, since it leaves all those calls to
> default constructor, but still removes the symptom).
>
> I googled a bit, but could not find any actual rules about default
> constructors and shared memory.
Thanks for this insight! I don't think this would've occurred to me just
From staring at the code. :) Indeed, inserting __syncthreads() after the
shared array declaration brings the error down to more reasonable values
for me. Jesse, my recommendation would be to use that as a workaround
while we figure out a more permanent fix.
I just searched as well, but couldn't find anything. I've pinged Nathan
Bell at Nvidia (coauthor of thrust) to see what he thinks about this.
Andreas
Hi Jesse,
On Wed, 25 Jan 2012 14:51:05 -0800, Jesse Lu <jesselu(a)stanford.edu> wrote:
> The attached script shows highly inconsistent results (> 10% error at
> times) between the numpy and gpuarray dot products. The inconsistent
> results seem to only appear for large gpuarrays of data type complex64 or
> complex128. Any ideas on what's going on? Thanks!
I can reproduce the issue, and I'll try to see what's behind
it. Unfortunately, I have a lot of stuff to do at the moment, so I can't
give you an ETA. Of course, I'd also appreciate any help in getting this
debugged--from anyone on the list! :)
Andreas
Hi,
The attached script shows highly inconsistent results (> 10% error at
times) between the numpy and gpuarray dot products. The inconsistent
results seem to only appear for large gpuarrays of data type complex64 or
complex128. Any ideas on what's going on? Thanks!
Jesse
On Tue, 24 Jan 2012 12:54:07 -0800, Massimo Becker <mbecker16(a)gmail.com> wrote:
> Hi,
>
> I've managed to successfully install PyCUDA and run the test driver.
> I also updated the wiki with instructions for OSX 10.7 and CUDA 4.0
> If anyone would like to review these and let me know if it works or if
> I missed anything, I would be happy to make changes.
Thanks very much!
Andreas