Continuing my post about a simple speed test from before (thanks Ian!)
I have a modified version (at the end). This was the original thread:
http://tiker.net/pipermail/pycuda_tiker.net/2010-January/000940.html
The new code runs a loop on sin() using get_function, using a GPUArray
and using straight numpy. The get_function version is fastest, then
the GPUArray (limited I guess by memory copies to/from the device on
each iteration), then numpy.
What confuses me is the following timing result:
Using nbr_values == 8192
Calculating 100000 iterations
GPU time: 0.167165344238
GPU result starts with... [ 0.005477 0.005477 0.005477]
GPUArray time: 4.83845751953
GPU result starts with... [ 0.005477 0.005477 0.005477]
CPU time: 32.987859375
CPU result starts with... [ 0.005477 0.005477 0.005477]
It looks as though the GPU solution (get_function) is 205 times faster
than the CPU version. Does this make sense?
I'd expected up to a 100* speed-up, seeing 200* makes me suspicious.
Can anyone spot anything silly in the code? Maybe 200* is entirely
sensible for a simple for loop on a simple math operation?
As before I'm using a WinXP Intel Core2 Duo 2.66GHz CPU (1 CPU used)
with a 9800GT GPU.
If this example makes sense then I'll add it to the wiki but I wanted
to do a sanity check first...
---------------------------------------
# based on hello_gpu.py
import pycuda.driver as drv
import pycuda.tools
import pycuda.autoinit
import numpy
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gpuarray
import pycuda.cumath
blocks = 64
block_size = 128
nbr_values = blocks * block_size
print "Using nbr_values ==", nbr_values
# number of iterations for the calculations,
# 100 is very quick, 2000000 will take a while
n_iter = 100000
print "Calculating %d iterations" % (n_iter)
# create two timers so we can speed-test each approach
start = drv.Event()
end = drv.Event()
#############
# GPU SECTION
mod = SourceModule("""
__global__ void addone(float *dest, float *a, int n_iter)
{
const int i = blockDim.x*blockIdx.x + threadIdx.x;
for(int n = 0; n < n_iter; n++) {
a[i] = sin(a[i]);
}
dest[i] = a[i];
}
""")
addone = mod.get_function("addone")
# create an array of 1s
a = numpy.ones(nbr_values).astype(numpy.float32)
# create a destination array that will receive the result
dest = numpy.zeros_like(a)
start.record() # start timing
addone(drv.Out(dest), drv.In(a), numpy.int32(n_iter), grid=(blocks,1),
block=(block_size,1,1) )
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "GPU time:", secs
print "GPU result starts with...", dest[:3]
##################
# GPUArray SECTION
# here the array is copied back and forth to the card with each iteration
# which is a bottleneck
a = numpy.ones(nbr_values).astype(numpy.float32)
a_gpu = gpuarray.to_gpu(a)
start.record() # start timing
for i in range(n_iter):
a_gpu = pycuda.cumath.sin(a_gpu)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "GPUArray time:", secs
print "GPU result starts with...", a_gpu.get()[:3]
#############
# CPU SECTION
# use numpy the calculate the result on the CPU
a = numpy.ones(nbr_values).astype(numpy.float32)
start.record() # start timing
for i in range(n_iter):
a = numpy.sin(a)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print "CPU time:", secs
print "CPU result starts with...", a[:3]
---------------------------------------
--
Ian Ozsvald (A.I. researcher, screencaster)
ian(a)IanOzsvald.com
http://IanOzsvald.com
http://morconsulting.com/
http://TheScreencastingHandbook.com
http://ProCasts.co.uk/examples.html
http://twitter.com/ianozsvald