Hi,guys
    I am trying to play with the example code: hello_gpu.py with some modification on the size of block and grid.

I found if I set size to be 512, which indicating that all data are in one thread block, the result max(dest-a*b)=0; however,
if I set size to 512*2, which indicates data is decomposed into two grids, the max(dest-a*b)!=0, which confused me.
did I miss something here? I put the code for your reference. thank you




  1 import pycuda.driver as drv
  2 import pycuda.tools
  3 import pycuda.autoinit
  4 import numpy
  5 import numpy.linalg as la
  6 from pycuda.compiler import SourceModule
  7
  8 mod = SourceModule("""
  9 __global__ void multiply_them(float *dest, float *a, float *b)
 10 {
 11   const int i = threadIdx.x;
 12   dest[i] = a[i] * b[i];
 13 }
 14 """)
 15
 16 multiply_them = mod.get_function("multiply_them")
 17
 18 size = 512*2
 19 threads = 512
 20 a = numpy.random.randn(size).astype(numpy.float32)
 21 b = numpy.random.randn(size).astype(numpy.float32)
 22
 23 dest = numpy.zeros_like(a)
 24 multiply_them(
 25         drv.Out(dest), drv.In(a), drv.In(b),
 26         block=(threads,1,1),grid=(size/threads,1))
 27
 28 print max(dest-a*b)

best regards
zhu