Hello.
I've been packaging PyCUDA for Debian.
I run all the tests to ensure that package works on Python 2
and Python 3. All tests pass except for on from test_driver.py:
$ python test_driver.py
============================= test session starts
==============================
platform linux2 -- Python 2.7.5 -- pytest-2.3.5
collected 21 items
test_driver.py ........F............
=================================== FAILURES
===================================
_____________________ TestDriver.test_register_host_memory
_____________________
args = (<test_driver.TestDriver instance at 0x24e7d88>,), kwargs = {}
pycuda = <module 'pycuda' from
'/usr/lib/python2.7/dist-packages/pycuda/__init__.pyc'>
ctx = <pycuda._driver.Context object at 0x2504488>
clear_context_caches = <function clear_context_caches at 0x1dbf848>
collect = <built-in function collect>
def f(*args, **kwargs):
import pycuda.driver
# appears to be idempotent, i.e. no harm in calling it more than
once
pycuda.driver.init()
ctx = make_default_context()
try:
assert isinstance(ctx.get_device().name(), str)
assert isinstance(ctx.get_device().compute_capability(),
tuple)
assert isinstance(ctx.get_device().get_attributes(), dict)
> inner_f(*args, **kwargs)
/usr/lib/python2.7/dist-packages/pycuda/tools.py:434:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _
self = <test_driver.TestDriver instance at 0x24e7d88>
@mark_cuda_test
def test_register_host_memory(self):
if drv.get_version() < (4,):
from py.test import skip
skip("register_host_memory only exists on CUDA 4.0 and
later")
import sys
if sys.platform == "darwin":
from py.test import skip
skip("register_host_memory is not supported on OS X")
a = drv.aligned_empty((2**20,), np.float64, alignment=4096)
> drv.register_host_memory(a)
E LogicError: cuMemHostRegister failed: invalid value
test_driver.py:559: LogicError
==================== 1 failed, 20 passed in 116.85 seconds
=====================
This test fails both on ION (GeForce 9400M, CC 1.1) and GeForce 460
(CC 2.1). I've compiled PyCUDA with gcc 4.8, run with kernel 3.9
and drivers 304.88.
Regards.
--
Tomasz Rybak GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
On Thu, 29 Aug 2013 07:26:40 -0400
Ahmed Fasih <wuzzyview(a)gmail.com> wrote:
> Since PyCUDA is just a wrapper to the CUDA C library (with >99.9%
> coverage), anything CUDA supports, PyCUDA should support. More directly,
> PyCUDA and K20 work fine for me.
I am running PyCuda on a debian6 computer with a GeForce Titan. Driver is 319.32 and Cuda 4.2.
My only problem is when I try to use scikit.cuda (for CuFFT) where sm_35 is not recognized.
The issue is probably located around the old version of cuda.
If you stick to pure PyCuda, it runs fine.
--
Jérôme Kieffer
On-Line Data analysis / Software Group
ISDD / ESRF
tel +33 476 882 445
I do not have a K20, but I have successfully used PyCUDA with CUDA 5.5 and a GT 640 (GK208) card, which is a compute capability 3.5 device.
On Aug 29, 2013, at 6:05 AM, Michael Owen <mowen.1024(a)gmail.com> wrote:
> Hi,
> We are looking at buying some Kepler K20 Tesla cards, and was wondering what Compute Capability version PyCuda supports? I believe the K20 is at 3.5, is PyCuda there too?
>
> If not, and I see the changelogs suggest Cuda 4.1 is the latest, will using Cuda 5.5 be a problem with the latest PyCuda? Will it just be a case that certain bindings won't be present, e.g. for new features?
>
> Cheers
>
> Mike
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
Hi,
We are looking at buying some Kepler K20 Tesla cards, and was wondering
what Compute Capability version PyCuda supports? I believe the K20 is at
3.5, is PyCuda there too?
If not, and I see the changelogs suggest Cuda 4.1 is the latest, will using
Cuda 5.5 be a problem with the latest PyCuda? Will it just be a case that
certain bindings won't be present, e.g. for new features?
Cheers
Mike
Andreas Baumbach <healther.astro(a)gmail.com> writes:
> Well, gpuarray offers far more than i actually need, like it knows the size
> of the "array", which I will never use... It just feels like there should
> be a lighter way to do it, than to use the "big gun" of gpuarray.
There isn't, and even if there were a separate "GPU scalar" class, it
would look essentially the same as GPUArray.
> I did the timing using kernprof.py (http://pythonhosted.org/line_profiler/)
> the timings from above come from that output. I ran some further test via
> ipython and only doing the multiplication:
> The timeit function reports roughly the same time for only the
> multiplication as was reported for the c=a/b call. My guess atm is: all 4
> python statements add their stuff to the scheduler of cuda and only if you
> want to access it (like i do in the next line, where it is a factor for the
> linearcombination kernel) python has to wait for the result. And as the
> profiler profiles my python code and not the actual work, it reports such
> seemingly strange values.
Python profilers are not appropriate for profiling GPU programs, because
the GPU runs asynchronously from the host program. Check the CUDA
programming guide for answers to your other questions.
> Also I tried to use the CUDAs own profiler, but I dont really under stand
> what it is telling me and how I can use it to speed up my program. So there
> is another couple of question I ran into:
> How is the number of registers a thread uses determined?
nvcc does that.
> How does the number of registers relate to the occupancy? (I fear I'm
> missing some basics in order to understand and appreciate the cuda
> occupancy calculator)
-> programming guide
> What is the influence of the grid dimensions and block dimensions? (not the
> total size, but the spread along the axis)
-> programming guide
Andreas
Andreas Baumbach <healther.astro(a)gmail.com> writes:
> a couple of weeks ago I asked a question regarding the gpuarray.muladd
> function, as it only takes scalar values from the ram and is not able
> to take any data directly from the gpu.
> The solution offered by Andreas back then was to simply write an own
> linear combination kernel. That is what I just finished.
> In writing this kernel I faced the question: How does one usually
> store a single floatingpoint number on the gpu with pycuda?
> I simply use a gpuarray object of length one, which works just fine,
> but imho it's kind of an overkill.
How so?
(Btw, the canonical way to do scalars on the GPU is to use shape
'()'. numpy allows the same thing for its 'array scalars'.)
> The second question is that profiling of my code revealed that nearly
> all of the time is used in a single division on the GPU. The code is
> like:
>
> multiply(matrix, vector1, result)
> a = gpuarray.dot(vector2,vector2)
> b = gpuarray.dot(vector1, result)
> c = a/b
>
> with vector1 and vector2 being large gpuarrays (10^6 entries),
> multiply constructs the result of my matrix-vector-product (this
> exists only in algorithmic form) and stores it in result, which is
> also an gpuarray of the same size as vector1 and vector2
>
> I would have expected that multiply takes most of the time but 99.8%
> is used in the c=a/b call.
>
> Has anyone an explanation to offer?
How are you doing the timing? Have you looked at profiler output?
Andreas
Hey,
a couple of weeks ago I asked a question regarding the gpuarray.muladd
function, as it only takes scalar values from the ram and is not able
to take any data directly from the gpu.
The solution offered by Andreas back then was to simply write an own
linear combination kernel. That is what I just finished.
In writing this kernel I faced the question: How does one usually
store a single floatingpoint number on the gpu with pycuda?
I simply use a gpuarray object of length one, which works just fine,
but imho it's kind of an overkill.
The second question is that profiling of my code revealed that nearly
all of the time is used in a single division on the GPU. The code is
like:
multiply(matrix, vector1, result)
a = gpuarray.dot(vector2,vector2)
b = gpuarray.dot(vector1, result)
c = a/b
with vector1 and vector2 being large gpuarrays (10^6 entries),
multiply constructs the result of my matrix-vector-product (this
exists only in algorithmic form) and stores it in result, which is
also an gpuarray of the same size as vector1 and vector2
I would have expected that multiply takes most of the time but 99.8%
is used in the c=a/b call.
Has anyone an explanation to offer?
Cheers,
Andi
Hi David,
What libraries do you have in <cuda_installation_dir>/lib?
(<cuda_installation_dir> is /usr/local/cuda by default). I have both
libcuda.dylib and libcudart.dylib there.
On Mon, Aug 19, 2013 at 1:39 PM, David P. Sanders
<dpsanders(a)ciencias.unam.mx> wrote:
> Hi,
>
> I am trying to install PyCUDA for the first time, but I am having trouble
> with the installation.
>
> I have installed CUDA 5.5 on my Mac with Mac OS 10.8 (Mountain Lion).
>
> However, neither the lib64 nor the libcuda libraries seem to be installed in
> this version; both are apparently required for PyCUDA.
>
> Could somebody please suggest a solution?
>
> Thanks and best wishes,
> David.
>
> --
> Dr. David P. Sanders
>
> Profesor Titular "A" / Associate Professor
> Departamento de Física, Facultad de Ciencias
> Universidad Nacional Autónoma de México (UNAM)
>
> dpsanders(a)ciencias.unam.mx
> http://sistemas.fciencias.unam.mx/~dsanders
>
> Cubículo / office: #414, 4o. piso del Depto. de Física
> Tel.: +52 55 5622 4965
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
Hi,
I am trying to install PyCUDA for the first time, but I am having trouble
with the installation.
I have installed CUDA 5.5 on my Mac with Mac OS 10.8 (Mountain Lion).
However, neither the lib64 nor the libcuda libraries seem to be installed
in this version; both are apparently required for PyCUDA.
Could somebody please suggest a solution?
Thanks and best wishes,
David.
--
Dr. David P. Sanders
Profesor Titular "A" / Associate Professor
Departamento de Física, Facultad de Ciencias
Universidad Nacional Autónoma de México (UNAM)
dpsanders(a)ciencias.unam.mx
http://sistemas.fciencias.unam.mx/~dsanders
Cubículo / office: #414, 4o. piso del Depto. de Física
Tel.: +52 55 5622 4965
Vivek Saxena <spinor87(a)gmail.com> writes:
> This problem was solved by the following commands:
>
> sudo ln -s /usr/lib/nvidia-325/libcuda.so /usr/lib/libcuda.so
> sudo ln -s /usr/lib/nvidia-325/libcuda.so.1 /usr/lib/libcuda.so.1
>
> But I now get a message saying
>
> TypeError: No registered converter was available to produce a C++ rvalue of
> type unsigned int from this Python object of type float
>
> when I run the matrix multiplication example code in PyCUDA's documentation.
>
> This is an error I would see only on ArchLinux, but my Ubuntu laptop with
> PyCUDA hasn't ever returned such an error. I have seen the mailing list and
> others have had similar woes, but I couldn't find a working solution to
> this problem. My code is reproduced below:
I think I've fixed the example on the wiki. Thanks for pointing out the issue.
Andreas