a couple of weeks ago I asked a question regarding the gpuarray.muladd
function, as it only takes scalar values from the ram and is not able
to take any data directly from the gpu.
The solution offered by Andreas back then was to simply write an own
linear combination kernel. That is what I just finished.
In writing this kernel I faced the question: How does one usually
store a single floatingpoint number on the gpu with pycuda?
I simply use a gpuarray object of length one, which works just fine,
but imho it's kind of an overkill.
The second question is that profiling of my code revealed that nearly
all of the time is used in a single division on the GPU. The code is
multiply(matrix, vector1, result)
a = gpuarray.dot(vector2,vector2)
b = gpuarray.dot(vector1, result)
c = a/b
with vector1 and vector2 being large gpuarrays (10^6 entries),
multiply constructs the result of my matrix-vector-product (this
exists only in algorithmic form) and stores it in result, which is
also an gpuarray of the same size as vector1 and vector2
I would have expected that multiply takes most of the time but 99.8%
is used in the c=a/b call.
Has anyone an explanation to offer?
Show replies by date