[PyCUDA] PyCuda Memory Question

Andreas Klöckner lists at informa.tiker.net
Tue Nov 3 17:04:24 PST 2009


Hi Aaron,

Well, you're copying uninitialized memory to the GPU. "empty" = 
"uninitialized" in numpy.

Also, you may want to look at PyCUDA's GPUArrays.

Andreas


On Dienstag 03 November 2009, Aaron Benjamin Greenblatt wrote:
> Sorry for all the posts, but I am able to reproduce this behavior by moving
>  my python print statements around on the code that doesn't have any C
>  source blocks.
> 
> Aaron
> 
> 
> ----- Original Message -----
> From: "Aaron Benjamin Greenblatt" <aarong at stanford.edu>
> To: pycuda at tiker.net
> Sent: Tuesday, November 3, 2009 11:17:30 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [PyCUDA] PyCuda Memory Question
> 
> Well, that's not helpful. I didn't paste the output with the C source
>  included. Here it is: x:
> [[ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]]
> y
> [[  0.   0.  NaN   0.   0.]
>  [  0.   0.   0.   0.   0.]
>  [  0.   0.   0.   0.   0.]
>  [  0.   0.   0.   0.   0.]]
> ydes
> [[ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]]
> weightsL1
> [[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]
>  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]
>  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]
>  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]]
> L1preadd
> [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]
>  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]
>  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]
>  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]]
> L1s
> [  0.   0.  Inf  Inf]
> L1xout
> [ 0.  0.  0.  0.]
> weightsL2
> [[ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]]
> L2preadd
> [[ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]]
> L2s
> [ 0.  0.  0.  0.]
> L2xout
> [ 0.  0.  0.  0.]
> weightsL3
> [[ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]]
> L3preadd
> [[  0.   0.   0.   0.]
>  [  0.  Inf   0.   0.]
>  [  0.   0.   0.   0.]
>  [  0.   0.   0.   0.]]
> L3s
> [ 0.  0.  0.  0.]
> L3xout
> [ 0.  0.  0.  0.]
> 
> 
> ----- Original Message -----
> From: "Aaron Greenblatt" <aarong at stanford.edu>
> To: pycuda at tiker.net
> Sent: Tuesday, November 3, 2009 11:05:52 AM GMT -08:00 US/Canada Pacific
> Subject: [PyCUDA] PyCuda Memory Question
> 
> Hi,
> 
> I'm new to Python but have coded stuff in C / CUDA before.
> 
> I am trying to copy some variables from Python / Numpy to a GPU, and then
>  back to the host again. When I get the stuff back from the GPU, I appear
>  to get a few random NaN's and Inf values - I'm confused as to why these
>  are happening. I have a few C source modules in the Python script, and,
>  when I remove them, some of the Inf's go away. This confuses me even more,
>  as I never even called the functions in the C source modules, so removing
>  them shouldn't make a difference. (Or am I missing something there too?)
> 
> It almost seems like the system / video driver is overwriting the memory
>  that I write on the video card. Is this a possibility and, if so, how does
>  one deal with it in PyCuda? (I haven't run into this issue when working on
>  C / CUDA before, but my dataset was also pretty small). I'm going to look
>  through nVidia's CUDA programming guide again to make sure that I'm not
>  missing something obvoius.
> 
> Also, I know that I need to optimize the code in the C modules - for now I
>  just want to get something working, and then I'll write C code that uses
>  the hardware better.
> 
> I've attached source code and output with and without the C source modules.
> 
> Does anyone have thoughts as to what's going on here? Thanks for your help!
> 
> Aaron
> 
> 
> **** Script  without C source ***
> 
> # Sample source code from the Tutorial Introduction in the documentation.
> 
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import numpy
> 
> x = numpy.ones([4,5]) * .01
> ydes = x
> y = numpy.empty_like(x)
> L1neurons = 4
> L2neurons = 4
> L3neurons = 4
> L1weightsPerNeuron = x.size
> L2weightsPerNeuron = L1neurons
> L3weightsPerNeuron = L2neurons
> weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
> weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
> weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
> L1s = numpy.empty([L1neurons])
> L2s = numpy.empty([L2neurons])
> L3s = numpy.empty([L3neurons])
> L1xout = numpy.empty_like(L1s)
> L1PreAdd = numpy.empty_like(weightsL1)
> L2xout = numpy.empty_like(L2s)
> L2PreAdd = numpy.empty_like(weightsL2)
> L3xout = numpy.empty_like(L3s)
> L3PreAdd = numpy.empty_like(weightsL3)
> 
> # convert these variables to float singles for GPU use
> x = x.astype(numpy.float32)
> ydes = ydes.astype(numpy.float32)
> y = y.astype(numpy.float32)
> weightsL1 = weightsL1.astype(numpy.float32)
> weightsL2 = weightsL2.astype(numpy.float32)
> weightsL3 = weightsL3.astype(numpy.float32)
> L1s = L1s.astype(numpy.float32)
> L2s = L2s.astype(numpy.float32)
> L3s = L3s.astype(numpy.float32)
> L1PreAdd = L1PreAdd.astype(numpy.float32)
> L1xout = L1xout.astype(numpy.float32)
> L2PreAdd = L2PreAdd.astype(numpy.float32)
> L2xout = L2xout.astype(numpy.float32)
> L3PreAdd = L3PreAdd.astype(numpy.float32)
> L3xout = L3xout.astype(numpy.float32)
> 
> # allocate GPU memory
> GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
> GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
> GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
> GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
> GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
> GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
> GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
> GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
> GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
> GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
> GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
> GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
> GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
> GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
> GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)
> 
> # copy variables to GPU
> cuda.memcpy_htod(GPUx, x)
> cuda.memcpy_htod(GPUydes, ydes)
> cuda.memcpy_htod(GPUy, y)
> cuda.memcpy_htod(GPUweightsL1, weightsL1)
> cuda.memcpy_htod(GPUweightsL2, weightsL2)
> cuda.memcpy_htod(GPUweightsL3, weightsL3)
> cuda.memcpy_htod(GPUL1s, L1s)
> cuda.memcpy_htod(GPUL2s, L2s)
> cuda.memcpy_htod(GPUL3s, L3s)
> cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
> cuda.memcpy_htod(GPUL1xout, L1xout)
> cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
> cuda.memcpy_htod(GPUL2xout, L2xout)
> cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
> cuda.memcpy_htod(GPUL3xout, L3xout)
> 
> # Print stuff
> cuda.memcpy_dtoh(x, GPUx)
> cuda.memcpy_dtoh(ydes, GPUydes)
> cuda.memcpy_dtoh(y, GPUy)
> cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
> cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
> cuda.memcpy_dtoh(weightsL3, GPUweightsL3)
> 
> cuda.memcpy_dtoh(L1s, GPUL1s)
> cuda.memcpy_dtoh(L2s, GPUL2s)
> cuda.memcpy_dtoh(L3s, GPUL3s)
> cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
> cuda.memcpy_dtoh(L1xout, GPUL1xout)
> cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
> cuda.memcpy_dtoh(L2xout, GPUL2xout)
> cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
> cuda.memcpy_dtoh(L3xout, GPUL3xout)
> print "x:"
> print x
> print "y"
> print y
> print "ydes"
> print ydes
> print "weightsL1"
> print weightsL1
> print "L1preadd"
> print L1PreAdd
> print "L1s"
> print L1s
> print "L1xout"
> print L1xout
> print "weightsL2"
> print weightsL2
> print "L2preadd"
> print L2PreAdd
> print "L2s"
> print L2s
> print "L2xout"
> print L2xout
> print "weightsL3"
> print weightsL3
> print "L3preadd"
> print L3PreAdd
> print "L3s"
> print L3s
> print "L3xout"
> print L3xout
> 
> ****** Output without C source *****
> 
> x:
> [[ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]]
> y
> [[  0.   0.  NaN   0.   0.]
>  [  0.   0.   0.   0.   0.]
>  [  0.   0.   0.   0.   0.]
>  [  0.   0.   0.   0.   0.]]
> ydes
> [[ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]
>  [ 0.01  0.01  0.01  0.01  0.01]]
> weightsL1
> [[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]
>  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]
>  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]
>  [ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
>    1.  1.]]
> L1preadd
> [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]
>  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]
>  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]
>  [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
>    0.  0.]]
> L1s
> [ 0.  0.  0.  0.]
> L1xout
> [ 0.  0.  0.  0.]
> weightsL2
> [[ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]]
> L2preadd
> [[ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]]
> L2s
> [ 0.  0.  0.  0.]
> L2xout
> [ 0.  0.  0.  0.]
> weightsL3
> [[ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]
>  [ 1.  1.  1.  1.]]
> L3preadd
> [[ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]
>  [ 0.  0.  0.  0.]]
> L3s
> [ 0.  0.  0.  0.]
> L3xout
> [ 0.  0.  0.  0.]
> 
> 
> ******* Script with C Source ***************
> 
> # Sample source code from the Tutorial Introduction in the documentation.
> 
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import numpy
> 
> x = numpy.ones([4,5]) * .01
> ydes = x
> y = numpy.empty_like(x)
> L1neurons = 4
> L2neurons = 4
> L3neurons = 4
> L1weightsPerNeuron = x.size
> L2weightsPerNeuron = L1neurons
> L3weightsPerNeuron = L2neurons
> weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
> weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
> weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
> L1s = numpy.empty([L1neurons])
> L2s = numpy.empty([L2neurons])
> L3s = numpy.empty([L3neurons])
> L1xout = numpy.empty_like(L1s)
> L1PreAdd = numpy.empty_like(weightsL1)
> L2xout = numpy.empty_like(L2s)
> L2PreAdd = numpy.empty_like(weightsL2)
> L3xout = numpy.empty_like(L3s)
> L3PreAdd = numpy.empty_like(weightsL3)
> 
> # convert these variables to float singles for GPU use
> x = x.astype(numpy.float32)
> ydes = ydes.astype(numpy.float32)
> y = y.astype(numpy.float32)
> weightsL1 = weightsL1.astype(numpy.float32)
> weightsL2 = weightsL2.astype(numpy.float32)
> weightsL3 = weightsL3.astype(numpy.float32)
> L1s = L1s.astype(numpy.float32)
> L2s = L2s.astype(numpy.float32)
> L3s = L3s.astype(numpy.float32)
> L1PreAdd = L1PreAdd.astype(numpy.float32)
> L1xout = L1xout.astype(numpy.float32)
> L2PreAdd = L2PreAdd.astype(numpy.float32)
> L2xout = L2xout.astype(numpy.float32)
> L3PreAdd = L3PreAdd.astype(numpy.float32)
> L3xout = L3xout.astype(numpy.float32)
> 
> # allocate GPU memory
> GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
> GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
> GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
> GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
> GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
> GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
> GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
> GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
> GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
> GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
> GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
> GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
> GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
> GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
> GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)
> 
> # copy variables to GPU
> cuda.memcpy_htod(GPUx, x)
> cuda.memcpy_htod(GPUydes, ydes)
> cuda.memcpy_htod(GPUy, y)
> cuda.memcpy_htod(GPUweightsL1, weightsL1)
> cuda.memcpy_htod(GPUweightsL2, weightsL2)
> cuda.memcpy_htod(GPUweightsL3, weightsL3)
> cuda.memcpy_htod(GPUL1s, L1s)
> cuda.memcpy_htod(GPUL2s, L2s)
> cuda.memcpy_htod(GPUL3s, L3s)
> cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
> cuda.memcpy_htod(GPUL1xout, L1xout)
> cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
> cuda.memcpy_htod(GPUL2xout, L2xout)
> cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
> cuda.memcpy_htod(GPUL3xout, L3xout)
> 
> # C source code for stuff we do on GPU
> ForwardMult = SourceModule("""
> 	__global__ void layer1forward(float *x, float *weights, float *preAdd)
>     {
> 	// this does the multiplication in the forward neural net and outputs a
> pre-addition matrix
> 	//initialize variables
> 	int elementIdx = threadIdx.x + blockIdx.x*4;
> 	int neuronIdx = blockIdx.y;
> 	int numweights = blockDim.x * gridDim.x;
> 	// do multiply
> 	preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights +
> elementIdx] * x[elementIdx];
>     }
>     """)
> ForwardAdd = SourceModule("""
> 	__global__ void layer1forward(float *preAdd, float *s)
>     {
> 	// this does adds together the products from forwardmult.
> 	// do add
> 	int numweights = 20;
> 	for(int i = 0; i< numweights; i++) {
> 		s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * threadIdx.x + i];
> 	}
>     }
>     """)
> ForwardSigmoid = SourceModule("""
> 	__global__ void sigmoid(float *s, float *xout)
>     {
> 	// this applies the sigmoid function
> 	xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 +
>  exp(-2*s[threadIdx.x])); }
>     """)
> 
> # Print stuff
> cuda.memcpy_dtoh(x, GPUx)
> cuda.memcpy_dtoh(ydes, GPUydes)
> cuda.memcpy_dtoh(y, GPUy)
> cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
> cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
> cuda.memcpy_dtoh(weightsL3, GPUweightsL3)
> 
> cuda.memcpy_dtoh(L1s, GPUL1s)
> cuda.memcpy_dtoh(L2s, GPUL2s)
> cuda.memcpy_dtoh(L3s, GPUL3s)
> cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
> cuda.memcpy_dtoh(L1xout, GPUL1xout)
> cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
> cuda.memcpy_dtoh(L2xout, GPUL2xout)
> cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
> cuda.memcpy_dtoh(L3xout, GPUL3xout)
> print "x:"
> print x
> print "y"
> print y
> print "ydes"
> print ydes
> print "weightsL1"
> print weightsL1
> print "L1preadd"
> print L1PreAdd
> print "L1s"
> print L1s
> print "L1xout"
> print L1xout
> print "weightsL2"
> print weightsL2
> print "L2preadd"
> print L2PreAdd
> print "L2s"
> print L2s
> print "L2xout"
> print L2xout
> print "weightsL3"
> print weightsL3
> print "L3preadd"
> print L3PreAdd
> print "L3s"
> print L3s
> print "L3xout"
> print L3xout
> 
> **************** Output with C source **************
> 
> # Sample source code from the Tutorial Introduction in the documentation.
> 
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import numpy
> 
> x = numpy.ones([4,5]) * .01
> ydes = x
> y = numpy.empty_like(x)
> L1neurons = 4
> L2neurons = 4
> L3neurons = 4
> L1weightsPerNeuron = x.size
> L2weightsPerNeuron = L1neurons
> L3weightsPerNeuron = L2neurons
> weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
> weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
> weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
> L1s = numpy.empty([L1neurons])
> L2s = numpy.empty([L2neurons])
> L3s = numpy.empty([L3neurons])
> L1xout = numpy.empty_like(L1s)
> L1PreAdd = numpy.empty_like(weightsL1)
> L2xout = numpy.empty_like(L2s)
> L2PreAdd = numpy.empty_like(weightsL2)
> L3xout = numpy.empty_like(L3s)
> L3PreAdd = numpy.empty_like(weightsL3)
> 
> # convert these variables to float singles for GPU use
> x = x.astype(numpy.float32)
> ydes = ydes.astype(numpy.float32)
> y = y.astype(numpy.float32)
> weightsL1 = weightsL1.astype(numpy.float32)
> weightsL2 = weightsL2.astype(numpy.float32)
> weightsL3 = weightsL3.astype(numpy.float32)
> L1s = L1s.astype(numpy.float32)
> L2s = L2s.astype(numpy.float32)
> L3s = L3s.astype(numpy.float32)
> L1PreAdd = L1PreAdd.astype(numpy.float32)
> L1xout = L1xout.astype(numpy.float32)
> L2PreAdd = L2PreAdd.astype(numpy.float32)
> L2xout = L2xout.astype(numpy.float32)
> L3PreAdd = L3PreAdd.astype(numpy.float32)
> L3xout = L3xout.astype(numpy.float32)
> 
> # allocate GPU memory
> GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
> GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
> GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
> GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
> GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
> GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
> GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
> GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
> GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
> GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
> GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
> GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
> GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
> GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
> GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)
> 
> # copy variables to GPU
> cuda.memcpy_htod(GPUx, x)
> cuda.memcpy_htod(GPUydes, ydes)
> cuda.memcpy_htod(GPUy, y)
> cuda.memcpy_htod(GPUweightsL1, weightsL1)
> cuda.memcpy_htod(GPUweightsL2, weightsL2)
> cuda.memcpy_htod(GPUweightsL3, weightsL3)
> cuda.memcpy_htod(GPUL1s, L1s)
> cuda.memcpy_htod(GPUL2s, L2s)
> cuda.memcpy_htod(GPUL3s, L3s)
> cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
> cuda.memcpy_htod(GPUL1xout, L1xout)
> cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
> cuda.memcpy_htod(GPUL2xout, L2xout)
> cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
> cuda.memcpy_htod(GPUL3xout, L3xout)
> 
> # C source code for stuff we do on GPU
> ForwardMult = SourceModule("""
> 	__global__ void layer1forward(float *x, float *weights, float *preAdd)
>     {
> 	// this does the multiplication in the forward neural net and outputs a
> pre-addition matrix
> 	//initialize variables
> 	int elementIdx = threadIdx.x + blockIdx.x*4;
> 	int neuronIdx = blockIdx.y;
> 	int numweights = blockDim.x * gridDim.x;
> 	// do multiply
> 	preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights +
> elementIdx] * x[elementIdx];
>     }
>     """)
> ForwardAdd = SourceModule("""
> 	__global__ void layer1forward(float *preAdd, float *s)
>     {
> 	// this does adds together the products from forwardmult.
> 	// do add
> 	int numweights = 20;
> 	for(int i = 0; i< numweights; i++) {
> 		s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * threadIdx.x + i];
> 	}
>     }
>     """)
> ForwardSigmoid = SourceModule("""
> 	__global__ void sigmoid(float *s, float *xout)
>     {
> 	// this applies the sigmoid function
> 	xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 +
>  exp(-2*s[threadIdx.x])); }
>     """)
> 
> # Print stuff
> cuda.memcpy_dtoh(x, GPUx)
> cuda.memcpy_dtoh(ydes, GPUydes)
> cuda.memcpy_dtoh(y, GPUy)
> cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
> cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
> cuda.memcpy_dtoh(weightsL3, GPUweightsL3)
> 
> cuda.memcpy_dtoh(L1s, GPUL1s)
> cuda.memcpy_dtoh(L2s, GPUL2s)
> cuda.memcpy_dtoh(L3s, GPUL3s)
> cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
> cuda.memcpy_dtoh(L1xout, GPUL1xout)
> cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
> cuda.memcpy_dtoh(L2xout, GPUL2xout)
> cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
> cuda.memcpy_dtoh(L3xout, GPUL3xout)
> print "x:"
> print x
> print "y"
> print y
> print "ydes"
> print ydes
> print "weightsL1"
> print weightsL1
> print "L1preadd"
> print L1PreAdd
> print "L1s"
> print L1s
> print "L1xout"
> print L1xout
> print "weightsL2"
> print weightsL2
> print "L2preadd"
> print L2PreAdd
> print "L2s"
> print L2s
> print "L2xout"
> print L2xout
> print "weightsL3"
> print weightsL3
> print "L3preadd"
> print L3PreAdd
> print "L3s"
> print L3s
> print "L3xout"
> print L3xout
> 
> 
> 
> 
> _______________________________________________
> PyCUDA mailing list
> PyCUDA at tiker.net
> http://tiker.net/mailman/listinfo/pycuda_tiker.net
> 
> _______________________________________________
> PyCUDA mailing list
> PyCUDA at tiker.net
> http://tiker.net/mailman/listinfo/pycuda_tiker.net
> 
> _______________________________________________
> PyCUDA mailing list
> PyCUDA at tiker.net
> http://tiker.net/mailman/listinfo/pycuda_tiker.net
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 190 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.tiker.net/pipermail/pycuda/attachments/20091103/cc364b86/attachment-0001.pgp>


More information about the PyCUDA mailing list