Well, gpuarray offers far more than i actually need, like it knows the size
of the "array", which I will never use... It just feels like there should
be a lighter way to do it, than to use the "big gun" of gpuarray.
I did the timing using kernprof.py (http://pythonhosted.org/line_profiler/
the timings from above come from that output. I ran some further test via
ipython and only doing the multiplication:
The timeit function reports roughly the same time for only the
multiplication as was reported for the c=a/b call. My guess atm is: all 4
python statements add their stuff to the scheduler of cuda and only if you
want to access it (like i do in the next line, where it is a factor for the
linearcombination kernel) python has to wait for the result. And as the
profiler profiles my python code and not the actual work, it reports such
seemingly strange values.
Also I tried to use the CUDAs own profiler, but I dont really under stand
what it is telling me and how I can use it to speed up my program. So there
is another couple of question I ran into:
How is the number of registers a thread uses determined?
How does the number of registers relate to the occupancy? (I fear I'm
missing some basics in order to understand and appreciate the cuda
What is the influence of the grid dimensions and block dimensions? (not the
total size, but the spread along the axis)
I hope you don't mind me asking such unrelated questions
2013/8/22 Andreas Kloeckner <lists(a)informa.tiker.net>
Andreas Baumbach <healther.astro(a)gmail.com>
a couple of weeks ago I asked a question
regarding the gpuarray.muladd
function, as it only takes scalar values from the ram and is not able
to take any data directly from the gpu.
The solution offered by Andreas back then was to simply write an own
linear combination kernel. That is what I just finished.
In writing this kernel I faced the question: How does one usually
store a single floatingpoint number on the gpu with pycuda?
I simply use a gpuarray object of length one, which works just fine,
but imho it's kind of an overkill.
(Btw, the canonical way to do scalars on the GPU is to use shape
'()'. numpy allows the same thing for its 'array scalars'.)
The second question is that profiling of my code
revealed that nearly
all of the time is used in a single division on the GPU. The code is
multiply(matrix, vector1, result)
a = gpuarray.dot(vector2,vector2)
b = gpuarray.dot(vector1, result)
c = a/b
with vector1 and vector2 being large gpuarrays (10^6 entries),
multiply constructs the result of my matrix-vector-product (this
exists only in algorithmic form) and stores it in result, which is
also an gpuarray of the same size as vector1 and vector2
I would have expected that multiply takes most of the time but 99.8%
is used in the c=a/b call.
Has anyone an explanation to offer?
How are you doing the timing? Have you looked at profiler output?