[PyCUDA] PyCuda Memory Question
Aaron Benjamin Greenblatt
aarong at stanford.edu
Tue Nov 3 11:28:54 PST 2009
Sorry for all the posts, but I am able to reproduce this behavior by moving my python print statements around on the code that doesn't have any C source blocks.
Aaron
----- Original Message -----
From: "Aaron Benjamin Greenblatt" <aarong at stanford.edu>
To: pycuda at tiker.net
Sent: Tuesday, November 3, 2009 11:17:30 AM GMT -08:00 US/Canada Pacific
Subject: Re: [PyCUDA] PyCuda Memory Question
Well, that's not helpful. I didn't paste the output with the C source included. Here it is:
x:
[[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]]
y
[[ 0. 0. NaN 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
ydes
[[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]]
weightsL1
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]]
L1preadd
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]]
L1s
[ 0. 0. Inf Inf]
L1xout
[ 0. 0. 0. 0.]
weightsL2
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
L2preadd
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
L2s
[ 0. 0. 0. 0.]
L2xout
[ 0. 0. 0. 0.]
weightsL3
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
L3preadd
[[ 0. 0. 0. 0.]
[ 0. Inf 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
L3s
[ 0. 0. 0. 0.]
L3xout
[ 0. 0. 0. 0.]
----- Original Message -----
From: "Aaron Greenblatt" <aarong at stanford.edu>
To: pycuda at tiker.net
Sent: Tuesday, November 3, 2009 11:05:52 AM GMT -08:00 US/Canada Pacific
Subject: [PyCUDA] PyCuda Memory Question
Hi,
I'm new to Python but have coded stuff in C / CUDA before.
I am trying to copy some variables from Python / Numpy to a GPU, and then back
to the host again. When I get the stuff back from the GPU, I appear to get a few
random NaN's and Inf values - I'm confused as to why these are happening. I have
a few C source modules in the Python script, and, when I remove them, some of
the Inf's go away. This confuses me even more, as I never even called the
functions in the C source modules, so removing them shouldn't make a difference.
(Or am I missing something there too?)
It almost seems like the system / video driver is overwriting the memory that I
write on the video card. Is this a possibility and, if so, how does one deal
with it in PyCuda? (I haven't run into this issue when working on C / CUDA
before, but my dataset was also pretty small). I'm going to look through
nVidia's CUDA programming guide again to make sure that I'm not missing
something obvoius.
Also, I know that I need to optimize the code in the C modules - for now I just
want to get something working, and then I'll write C code that uses the hardware
better.
I've attached source code and output with and without the C source modules.
Does anyone have thoughts as to what's going on here? Thanks for your help!
Aaron
**** Script without C source ***
# Sample source code from the Tutorial Introduction in the documentation.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
x = numpy.ones([4,5]) * .01
ydes = x
y = numpy.empty_like(x)
L1neurons = 4
L2neurons = 4
L3neurons = 4
L1weightsPerNeuron = x.size
L2weightsPerNeuron = L1neurons
L3weightsPerNeuron = L2neurons
weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
L1s = numpy.empty([L1neurons])
L2s = numpy.empty([L2neurons])
L3s = numpy.empty([L3neurons])
L1xout = numpy.empty_like(L1s)
L1PreAdd = numpy.empty_like(weightsL1)
L2xout = numpy.empty_like(L2s)
L2PreAdd = numpy.empty_like(weightsL2)
L3xout = numpy.empty_like(L3s)
L3PreAdd = numpy.empty_like(weightsL3)
# convert these variables to float singles for GPU use
x = x.astype(numpy.float32)
ydes = ydes.astype(numpy.float32)
y = y.astype(numpy.float32)
weightsL1 = weightsL1.astype(numpy.float32)
weightsL2 = weightsL2.astype(numpy.float32)
weightsL3 = weightsL3.astype(numpy.float32)
L1s = L1s.astype(numpy.float32)
L2s = L2s.astype(numpy.float32)
L3s = L3s.astype(numpy.float32)
L1PreAdd = L1PreAdd.astype(numpy.float32)
L1xout = L1xout.astype(numpy.float32)
L2PreAdd = L2PreAdd.astype(numpy.float32)
L2xout = L2xout.astype(numpy.float32)
L3PreAdd = L3PreAdd.astype(numpy.float32)
L3xout = L3xout.astype(numpy.float32)
# allocate GPU memory
GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)
# copy variables to GPU
cuda.memcpy_htod(GPUx, x)
cuda.memcpy_htod(GPUydes, ydes)
cuda.memcpy_htod(GPUy, y)
cuda.memcpy_htod(GPUweightsL1, weightsL1)
cuda.memcpy_htod(GPUweightsL2, weightsL2)
cuda.memcpy_htod(GPUweightsL3, weightsL3)
cuda.memcpy_htod(GPUL1s, L1s)
cuda.memcpy_htod(GPUL2s, L2s)
cuda.memcpy_htod(GPUL3s, L3s)
cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
cuda.memcpy_htod(GPUL1xout, L1xout)
cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
cuda.memcpy_htod(GPUL2xout, L2xout)
cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
cuda.memcpy_htod(GPUL3xout, L3xout)
# Print stuff
cuda.memcpy_dtoh(x, GPUx)
cuda.memcpy_dtoh(ydes, GPUydes)
cuda.memcpy_dtoh(y, GPUy)
cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
cuda.memcpy_dtoh(weightsL3, GPUweightsL3)
cuda.memcpy_dtoh(L1s, GPUL1s)
cuda.memcpy_dtoh(L2s, GPUL2s)
cuda.memcpy_dtoh(L3s, GPUL3s)
cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
cuda.memcpy_dtoh(L1xout, GPUL1xout)
cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
cuda.memcpy_dtoh(L2xout, GPUL2xout)
cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
cuda.memcpy_dtoh(L3xout, GPUL3xout)
print "x:"
print x
print "y"
print y
print "ydes"
print ydes
print "weightsL1"
print weightsL1
print "L1preadd"
print L1PreAdd
print "L1s"
print L1s
print "L1xout"
print L1xout
print "weightsL2"
print weightsL2
print "L2preadd"
print L2PreAdd
print "L2s"
print L2s
print "L2xout"
print L2xout
print "weightsL3"
print weightsL3
print "L3preadd"
print L3PreAdd
print "L3s"
print L3s
print "L3xout"
print L3xout
****** Output without C source *****
x:
[[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]]
y
[[ 0. 0. NaN 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
ydes
[[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]
[ 0.01 0.01 0.01 0.01 0.01]]
weightsL1
[[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1.]]
L1preadd
[[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0.]]
L1s
[ 0. 0. 0. 0.]
L1xout
[ 0. 0. 0. 0.]
weightsL2
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
L2preadd
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
L2s
[ 0. 0. 0. 0.]
L2xout
[ 0. 0. 0. 0.]
weightsL3
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
L3preadd
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
L3s
[ 0. 0. 0. 0.]
L3xout
[ 0. 0. 0. 0.]
******* Script with C Source ***************
# Sample source code from the Tutorial Introduction in the documentation.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
x = numpy.ones([4,5]) * .01
ydes = x
y = numpy.empty_like(x)
L1neurons = 4
L2neurons = 4
L3neurons = 4
L1weightsPerNeuron = x.size
L2weightsPerNeuron = L1neurons
L3weightsPerNeuron = L2neurons
weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
L1s = numpy.empty([L1neurons])
L2s = numpy.empty([L2neurons])
L3s = numpy.empty([L3neurons])
L1xout = numpy.empty_like(L1s)
L1PreAdd = numpy.empty_like(weightsL1)
L2xout = numpy.empty_like(L2s)
L2PreAdd = numpy.empty_like(weightsL2)
L3xout = numpy.empty_like(L3s)
L3PreAdd = numpy.empty_like(weightsL3)
# convert these variables to float singles for GPU use
x = x.astype(numpy.float32)
ydes = ydes.astype(numpy.float32)
y = y.astype(numpy.float32)
weightsL1 = weightsL1.astype(numpy.float32)
weightsL2 = weightsL2.astype(numpy.float32)
weightsL3 = weightsL3.astype(numpy.float32)
L1s = L1s.astype(numpy.float32)
L2s = L2s.astype(numpy.float32)
L3s = L3s.astype(numpy.float32)
L1PreAdd = L1PreAdd.astype(numpy.float32)
L1xout = L1xout.astype(numpy.float32)
L2PreAdd = L2PreAdd.astype(numpy.float32)
L2xout = L2xout.astype(numpy.float32)
L3PreAdd = L3PreAdd.astype(numpy.float32)
L3xout = L3xout.astype(numpy.float32)
# allocate GPU memory
GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)
# copy variables to GPU
cuda.memcpy_htod(GPUx, x)
cuda.memcpy_htod(GPUydes, ydes)
cuda.memcpy_htod(GPUy, y)
cuda.memcpy_htod(GPUweightsL1, weightsL1)
cuda.memcpy_htod(GPUweightsL2, weightsL2)
cuda.memcpy_htod(GPUweightsL3, weightsL3)
cuda.memcpy_htod(GPUL1s, L1s)
cuda.memcpy_htod(GPUL2s, L2s)
cuda.memcpy_htod(GPUL3s, L3s)
cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
cuda.memcpy_htod(GPUL1xout, L1xout)
cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
cuda.memcpy_htod(GPUL2xout, L2xout)
cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
cuda.memcpy_htod(GPUL3xout, L3xout)
# C source code for stuff we do on GPU
ForwardMult = SourceModule("""
__global__ void layer1forward(float *x, float *weights, float *preAdd)
{
// this does the multiplication in the forward neural net and outputs a
pre-addition matrix
//initialize variables
int elementIdx = threadIdx.x + blockIdx.x*4;
int neuronIdx = blockIdx.y;
int numweights = blockDim.x * gridDim.x;
// do multiply
preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights +
elementIdx] * x[elementIdx];
}
""")
ForwardAdd = SourceModule("""
__global__ void layer1forward(float *preAdd, float *s)
{
// this does adds together the products from forwardmult.
// do add
int numweights = 20;
for(int i = 0; i< numweights; i++) {
s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * threadIdx.x + i];
}
}
""")
ForwardSigmoid = SourceModule("""
__global__ void sigmoid(float *s, float *xout)
{
// this applies the sigmoid function
xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 + exp(-2*s[threadIdx.x]));
}
""")
# Print stuff
cuda.memcpy_dtoh(x, GPUx)
cuda.memcpy_dtoh(ydes, GPUydes)
cuda.memcpy_dtoh(y, GPUy)
cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
cuda.memcpy_dtoh(weightsL3, GPUweightsL3)
cuda.memcpy_dtoh(L1s, GPUL1s)
cuda.memcpy_dtoh(L2s, GPUL2s)
cuda.memcpy_dtoh(L3s, GPUL3s)
cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
cuda.memcpy_dtoh(L1xout, GPUL1xout)
cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
cuda.memcpy_dtoh(L2xout, GPUL2xout)
cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
cuda.memcpy_dtoh(L3xout, GPUL3xout)
print "x:"
print x
print "y"
print y
print "ydes"
print ydes
print "weightsL1"
print weightsL1
print "L1preadd"
print L1PreAdd
print "L1s"
print L1s
print "L1xout"
print L1xout
print "weightsL2"
print weightsL2
print "L2preadd"
print L2PreAdd
print "L2s"
print L2s
print "L2xout"
print L2xout
print "weightsL3"
print weightsL3
print "L3preadd"
print L3PreAdd
print "L3s"
print L3s
print "L3xout"
print L3xout
**************** Output with C source **************
# Sample source code from the Tutorial Introduction in the documentation.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
x = numpy.ones([4,5]) * .01
ydes = x
y = numpy.empty_like(x)
L1neurons = 4
L2neurons = 4
L3neurons = 4
L1weightsPerNeuron = x.size
L2weightsPerNeuron = L1neurons
L3weightsPerNeuron = L2neurons
weightsL1 = numpy.ones([L1neurons,L1weightsPerNeuron])
weightsL2 = numpy.ones([L2neurons,L2weightsPerNeuron])
weightsL3 = numpy.ones([L3neurons,L3weightsPerNeuron])
L1s = numpy.empty([L1neurons])
L2s = numpy.empty([L2neurons])
L3s = numpy.empty([L3neurons])
L1xout = numpy.empty_like(L1s)
L1PreAdd = numpy.empty_like(weightsL1)
L2xout = numpy.empty_like(L2s)
L2PreAdd = numpy.empty_like(weightsL2)
L3xout = numpy.empty_like(L3s)
L3PreAdd = numpy.empty_like(weightsL3)
# convert these variables to float singles for GPU use
x = x.astype(numpy.float32)
ydes = ydes.astype(numpy.float32)
y = y.astype(numpy.float32)
weightsL1 = weightsL1.astype(numpy.float32)
weightsL2 = weightsL2.astype(numpy.float32)
weightsL3 = weightsL3.astype(numpy.float32)
L1s = L1s.astype(numpy.float32)
L2s = L2s.astype(numpy.float32)
L3s = L3s.astype(numpy.float32)
L1PreAdd = L1PreAdd.astype(numpy.float32)
L1xout = L1xout.astype(numpy.float32)
L2PreAdd = L2PreAdd.astype(numpy.float32)
L2xout = L2xout.astype(numpy.float32)
L3PreAdd = L3PreAdd.astype(numpy.float32)
L3xout = L3xout.astype(numpy.float32)
# allocate GPU memory
GPUx = cuda.mem_alloc(x.size * x.dtype.itemsize)
GPUydes = cuda.mem_alloc(ydes.size * ydes.dtype.itemsize)
GPUy = cuda.mem_alloc(y.size * ydes.dtype.itemsize)
GPUweightsL1 = cuda.mem_alloc(weightsL1.size * weightsL1.dtype.itemsize)
GPUweightsL2 = cuda.mem_alloc(weightsL2.size * weightsL2.dtype.itemsize)
GPUweightsL3 = cuda.mem_alloc(weightsL3.size * weightsL3.dtype.itemsize)
GPUL1s = cuda.mem_alloc(L1s.size * L1s.dtype.itemsize)
GPUL2s = cuda.mem_alloc(L2s.size * L2s.dtype.itemsize)
GPUL3s = cuda.mem_alloc(L3s.size * L3s.dtype.itemsize)
GPUL1PreAdd = cuda.mem_alloc(L1PreAdd.size * L1PreAdd.dtype.itemsize)
GPUL1xout = cuda.mem_alloc(L1xout.size * L1xout.dtype.itemsize)
GPUL2PreAdd = cuda.mem_alloc(L2PreAdd.size * L2PreAdd.dtype.itemsize)
GPUL2xout = cuda.mem_alloc(L2xout.size * L2xout.dtype.itemsize)
GPUL3PreAdd = cuda.mem_alloc(L3PreAdd.size * L3PreAdd.dtype.itemsize)
GPUL3xout = cuda.mem_alloc(L3xout.size * L3xout.dtype.itemsize)
# copy variables to GPU
cuda.memcpy_htod(GPUx, x)
cuda.memcpy_htod(GPUydes, ydes)
cuda.memcpy_htod(GPUy, y)
cuda.memcpy_htod(GPUweightsL1, weightsL1)
cuda.memcpy_htod(GPUweightsL2, weightsL2)
cuda.memcpy_htod(GPUweightsL3, weightsL3)
cuda.memcpy_htod(GPUL1s, L1s)
cuda.memcpy_htod(GPUL2s, L2s)
cuda.memcpy_htod(GPUL3s, L3s)
cuda.memcpy_htod(GPUL1PreAdd, L1PreAdd)
cuda.memcpy_htod(GPUL1xout, L1xout)
cuda.memcpy_htod(GPUL2PreAdd, L2PreAdd)
cuda.memcpy_htod(GPUL2xout, L2xout)
cuda.memcpy_htod(GPUL3PreAdd, L3PreAdd)
cuda.memcpy_htod(GPUL3xout, L3xout)
# C source code for stuff we do on GPU
ForwardMult = SourceModule("""
__global__ void layer1forward(float *x, float *weights, float *preAdd)
{
// this does the multiplication in the forward neural net and outputs a
pre-addition matrix
//initialize variables
int elementIdx = threadIdx.x + blockIdx.x*4;
int neuronIdx = blockIdx.y;
int numweights = blockDim.x * gridDim.x;
// do multiply
preAdd[neuronIdx*numweights+elementIdx] = weights[neuronIdx*numweights +
elementIdx] * x[elementIdx];
}
""")
ForwardAdd = SourceModule("""
__global__ void layer1forward(float *preAdd, float *s)
{
// this does adds together the products from forwardmult.
// do add
int numweights = 20;
for(int i = 0; i< numweights; i++) {
s[threadIdx.x] = s[threadIdx.x] + preAdd[numweights * threadIdx.x + i];
}
}
""")
ForwardSigmoid = SourceModule("""
__global__ void sigmoid(float *s, float *xout)
{
// this applies the sigmoid function
xout[threadIdx.x] = (1 - exp(-2*s[threadIdx.x])) / (1 + exp(-2*s[threadIdx.x]));
}
""")
# Print stuff
cuda.memcpy_dtoh(x, GPUx)
cuda.memcpy_dtoh(ydes, GPUydes)
cuda.memcpy_dtoh(y, GPUy)
cuda.memcpy_dtoh(weightsL1, GPUweightsL1)
cuda.memcpy_dtoh(weightsL2, GPUweightsL2)
cuda.memcpy_dtoh(weightsL3, GPUweightsL3)
cuda.memcpy_dtoh(L1s, GPUL1s)
cuda.memcpy_dtoh(L2s, GPUL2s)
cuda.memcpy_dtoh(L3s, GPUL3s)
cuda.memcpy_dtoh(L1PreAdd, GPUL1PreAdd)
cuda.memcpy_dtoh(L1xout, GPUL1xout)
cuda.memcpy_dtoh(L2PreAdd, GPUL2PreAdd)
cuda.memcpy_dtoh(L2xout, GPUL2xout)
cuda.memcpy_dtoh(L3PreAdd, GPUL3PreAdd)
cuda.memcpy_dtoh(L3xout, GPUL3xout)
print "x:"
print x
print "y"
print y
print "ydes"
print ydes
print "weightsL1"
print weightsL1
print "L1preadd"
print L1PreAdd
print "L1s"
print L1s
print "L1xout"
print L1xout
print "weightsL2"
print weightsL2
print "L2preadd"
print L2PreAdd
print "L2s"
print L2s
print "L2xout"
print L2xout
print "weightsL3"
print weightsL3
print "L3preadd"
print L3PreAdd
print "L3s"
print L3s
print "L3xout"
print L3xout
_______________________________________________
PyCUDA mailing list
PyCUDA at tiker.net
http://tiker.net/mailman/listinfo/pycuda_tiker.net
_______________________________________________
PyCUDA mailing list
PyCUDA at tiker.net
http://tiker.net/mailman/listinfo/pycuda_tiker.net
More information about the PyCUDA
mailing list