hi blahblahblah,
I had same error, and tried your way.
/I did this :/
user@ubuntu:~/pycuda-2011.2.2$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
>>> import pycuda.driver as cuda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pycuda/driver.py", line 2, in <module>
from pycuda._driver import *
ImportError: No module named _driver
/But i got error here too: /
user@ubuntu:~$ python
>>> import pycuda.driver as cuda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/usr/local/lib/python2.7/dist-packages/pycuda-2011.2.2-py2.7-linux-x86_64.egg/pycuda/driver.py",
line 2, in <module>
from pycuda._driver import *
ImportError: libcurand.so.4: wrong ELF class: ELFCLASS32
I am doing something wrong. Can you please elaborate ?
--
View this message in context: http://pycuda.2962900.n2.nabble.com/ImportError-No-module-named-driver-tp40…
Sent from the PyCuda mailing list archive at Nabble.com.
Dear PyCuda community,
First of all I would like to introduce myself: I am a scientific
developer and I am pretty new to PyCuda (even if I followed a CUDA
course). I would like to port part of a very big application to GPU,
switching from FFTw to scikit.cuda (cu_fft part). This was straight
forward, thanks to the very good abstraction done in PyCuda. I got
already speed-up of 5x with exactly the same result compared to fftw.
My problems starts when integrating the code into python-threads;
indeed the large application will make all PyCuda calls from different
threads and ends with memory leaks on the GPU crashing after a couple
of minutes. So I need to enforce all python threads to use the same
context on the GPU.
I have another question: why is the data1_gpu.ptr changing whereas data2_gpu.ptr and plan fixed (as expected in my code ?
Thanks for your help.
Cheers
--
Jérôme Kieffer
Data analysis unit - ESRF
Bryan Catanzaro <bcatanzaro(a)acm.org> writes:
> I agree that data size matters in these discussions. But I think the
> right way to account for it is show performance at a range of data
> sizes, as measured from Python.
>
> The assumption that you'll keep the GPU busy isn't necessarily true.
> thrust::reduce, for example (which max_element uses internally),
> launches a big kernel, followed by a small kernel to finish the
> reduction tree, followed by a cudaMemcpy to transfer the result back
> to the host. The GPU won't be busy during the small kernel, nor
> during the cudaMemcpy, nor during the conversion back to Python, etc.
> Reduce is often used to make control flow decisions in optimization
> loops, where you don't know what the next optimization step to be
> performed is until the result is known, and so you can't launch the
> work speculatively. If the control flow is performed in Python, all
> these overheads are exposed to application performance - so I think
> they matter.
Glad you brought that up. :) The conjugate gradient solver in PyCUDA
addresses exactly that by simply running iterations as fast as it can
and shepherding the residual results to the host on their own time,
deferring convergence decisions until the data is available. That was
good for a 20% or so gain last time I measured it (on a GT200).
Andreas
On Thu, 31 May 2012 12:56:15 +1200, Igor <rychphd(a)gmail.com> wrote:
> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/
>
> larger vector, a billion elements.
>
> As for returning the value, it's the pair of max value and position we
> are talking about, thrust returns the position and I'm now timing the
> extraction of the value from the gpu array which didn't change timing
> too much.
>
> ReductionKernel still appears 5 times slower than thrust.
>
> Bryan, on the same worksheet the numpy timing is printed as well:
> argmax is 3 times slower than ReductionKernel.
I've looked at this for a little while, can't quite make heads or tails
of it yet. Here's the profiler output:
method=[ reduce_kernel_stage1 ] gputime=[ 20617.984 ] cputime=[20647.000 ] gridsize=[ 128, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 672 ] inst_issued=[ 7906011 ]
method=[ reduce_kernel_stage2 ] gputime=[ 9.696 ] cputime=[ 29.000 ] gridsize=[ 1, 1 ] threadblocksize=[ 512, 1, 1 ] occupancy=[ 0.333 ] l1_shared_bank_conflict=[ 96 ]
method=[ _ZN6thrust<snip>] gputime=[ 3556.736 ] cputime=[ 3583.000 ] gridsize=[ 32, 1 ] threadblocksize=[ 768, 1, 1 ] occupancy=[ 1.000 ] l1_shared_bank_conflict=[ 1255 ] inst_issued=[ 2964333 ]
method=[ _ZN6thrust6<snip>] gputime=[ 8.640 ] cputime=[ 30.000 ] gridsize=[ 1, 1 ] threadblocksize=[ 32, 1, 1 ] occupancy=[ 0.021 ] l1_shared_bank_conflict=[ 18 ]
Second stages are comparable, but PyCUDA receives a sound beating in the
first stage. I don't quite understand why though. Code-wise, PyCUDA and
thrust do mostly the same thing--some parameters are different, but I've
twiddled them, and they don't make a big difference. From the profile,
the main killer seems to be that thrust's code simply issues three times
fewer instructions. But I don't get why--the codes aren't that
different.
Compare yourself:
https://code.google.com/p/thrust/source/browse/thrust/system/detail/generic…https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/re…https://code.google.com/p/thrust/source/browse/thrust/system/cuda/detail/bl…
vs
https://github.com/inducer/pycuda/blob/b28595eb92345f561096e833062f11b89601…
I've even made a version of reduction that's even more directly like
what thrust does:
https://github.com/inducer/pycuda/blob/thrusty-reduce/pycuda/reduction.py
The timing is about the same, even a tad bit slower. I'd much appreciate
any clues. Igor, can you please check if the perf difference is the same
on just a simple sum'o'floats?
Andreas
Hi Bryan,
http://dev.math.canterbury.ac.nz/home/pub/26/ now has the timing
measured with Python's time.time() -- there isn't much difference. The
card is Tesla C2070.
Igor
On Thu, May 31, 2012 at 3:31 PM, Bryan Catanzaro <bcatanzaro(a)acm.org> wrote:
> Hi Igor -
> I meant that it's more useful to know the execution time of code
> running on the GPU from Python's perspective, since Python is the one
> driving the work, and the execution overheads can be significant.
> What timings do you get when you use timeit rather than CUDA events?
> Also, what GPU are you running on?
>
> - bryan
>
> On Wed, May 30, 2012 at 5:56 PM, Igor <rychphd(a)gmail.com> wrote:
>> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/
>>
>> larger vector, a billion elements.
>>
>> As for returning the value, it's the pair of max value and position we
>> are talking about, thrust returns the position and I'm now timing the
>> extraction of the value from the gpu array which didn't change timing
>> too much.
>>
>> ReductionKernel still appears 5 times slower than thrust.
>>
>> Bryan, on the same worksheet the numpy timing is printed as well:
>> argmax is 3 times slower than ReductionKernel.
>>
>>
>>
>>
>> On Thu, May 31, 2012 at 12:08 PM, Andreas Kloeckner
>> <lists(a)informa.tiker.net> wrote:
>>> On Wed, 30 May 2012 22:13:27 +1200, Igor <rychphd(a)gmail.com> wrote:
>>>> Hi Andreas,
>>>> I'm attaching an example for your wiki demonstrating how to find a max
>>>> element position both using ReductionKernel and thrust-nvcc-ctypes.
>>>> The latter doesn't quite work on windows yet. Should work if you're on
>>>> a linux, just change the FOLDER. There is a live version published on
>>>> my sage server (http://dev.math.canterbury.ac.nz/home/pub/26/ ) --
>>>> there all work and show a discouraging 5-fold slowdown of
>>>> ReductionKernel as compared to thrust (run twice, as the .so file is
>>>> loaded lazily?). Could you take a look and edit it if necessary?
>>>
>>> Not a fair comparison. The PyCUDA test includes the transfer of the
>>> result to the host. (.get()) Doesn't look like that's the case for
>>> thrust. Also, an 80 MB vector is tiny. At 200 GB/s, that's about 4e-4s,
>>> which is in the vicinity of launch overhead.
>>>
>>> Andreas
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA(a)tiker.net
>> http://lists.tiker.net/listinfo/pycuda
On Wed, 30 May 2012 21:58:13 -0700, Bryan Catanzaro <bcatanzaro(a)acm.org> wrote:
> Why should the overhead be measured separately? For users of these
> systems, the Python overhead is unavoidable. The time spent running
> on the GPU alone is an important implementation detail for people
> improving systems like PyCUDA, but users of these systems see overhead
> costs exposed in their overall application performance, and so I don't
> see how the overhead can be ignored.
Because whether the overhead matters or not depends on data size. Since
the overhead is constant across all data sizes, that overhead is going
to be mostly irrelevant for big data, whereas for tiny data it might
well be a dealbreaker.
That's why I think a single number doesn't cut it.
In addition, there's an underlying assumption that you'll keep the GPU
busy for a while, i.e. keep the GPU queue saturated. If you do that (the
ability to do that being related, again, to data size), then on top of
that anything Python does runs in parallel to the GPU--and your net run
time will be exactly the same as if the overhead never happened.
Andreas
Why should the overhead be measured separately? For users of these
systems, the Python overhead is unavoidable. The time spent running
on the GPU alone is an important implementation detail for people
improving systems like PyCUDA, but users of these systems see overhead
costs exposed in their overall application performance, and so I don't
see how the overhead can be ignored.
- bryan
On Wed, May 30, 2012 at 9:47 PM, Andreas Kloeckner
<kloeckner(a)cims.nyu.edu> wrote:
> On Wed, 30 May 2012 20:31:40 -0700, Bryan Catanzaro <bcatanzaro(a)acm.org> wrote:
>> Hi Igor -
>> I meant that it's more useful to know the execution time of code
>> running on the GPU from Python's perspective, since Python is the one
>> driving the work, and the execution overheads can be significant.
>> What timings do you get when you use timeit rather than CUDA events?
>> Also, what GPU are you running on?
>
> timeit isn't really the right way to measure this, I think. There's some
> amount of Python overhead, of course, and it should be measured
> separately (and of course reduced, if possible). Once that's done, see
> how long the GPU works on its part of the job for a few vector sizes,
> and then figure out the vector size above which the Python time is as
> long as the GPU time and see where that sits compared to your typical
> data size.
>
> That would be more useful, IMO.
>
> Andreas
>
> --
> Andreas Kloeckner
> Room 1105A (Warren Weaver Hall), Courant Institute, NYU
> http://www.cims.nyu.edu/~kloeckner/
> +1-401-648-0599
Hi Igor -
I meant that it's more useful to know the execution time of code
running on the GPU from Python's perspective, since Python is the one
driving the work, and the execution overheads can be significant.
What timings do you get when you use timeit rather than CUDA events?
Also, what GPU are you running on?
- bryan
On Wed, May 30, 2012 at 5:56 PM, Igor <rychphd(a)gmail.com> wrote:
> I've updated the http://dev.math.canterbury.ac.nz/home/pub/26/
>
> larger vector, a billion elements.
>
> As for returning the value, it's the pair of max value and position we
> are talking about, thrust returns the position and I'm now timing the
> extraction of the value from the gpu array which didn't change timing
> too much.
>
> ReductionKernel still appears 5 times slower than thrust.
>
> Bryan, on the same worksheet the numpy timing is printed as well:
> argmax is 3 times slower than ReductionKernel.
>
>
>
>
> On Thu, May 31, 2012 at 12:08 PM, Andreas Kloeckner
> <lists(a)informa.tiker.net> wrote:
>> On Wed, 30 May 2012 22:13:27 +1200, Igor <rychphd(a)gmail.com> wrote:
>>> Hi Andreas,
>>> I'm attaching an example for your wiki demonstrating how to find a max
>>> element position both using ReductionKernel and thrust-nvcc-ctypes.
>>> The latter doesn't quite work on windows yet. Should work if you're on
>>> a linux, just change the FOLDER. There is a live version published on
>>> my sage server (http://dev.math.canterbury.ac.nz/home/pub/26/ ) --
>>> there all work and show a discouraging 5-fold slowdown of
>>> ReductionKernel as compared to thrust (run twice, as the .so file is
>>> loaded lazily?). Could you take a look and edit it if necessary?
>>
>> Not a fair comparison. The PyCUDA test includes the transfer of the
>> result to the host. (.get()) Doesn't look like that's the case for
>> thrust. Also, an 80 MB vector is tiny. At 200 GB/s, that's about 4e-4s,
>> which is in the vicinity of launch overhead.
>>
>> Andreas
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda