hi blahblahblah,
I had same error, and tried your way.
/I did this :/
user@ubuntu:~/pycuda-2011.2.2$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59)
[GCC 4.6.3] on linux2
>>> import pycuda.driver as cuda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pycuda/driver.py", line 2, in <module>
from pycuda._driver import *
ImportError: No module named _driver
/But i got error here too: /
user@ubuntu:~$ python
>>> import pycuda.driver as cuda
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File
"/usr/local/lib/python2.7/dist-packages/pycuda-2011.2.2-py2.7-linux-x86_64.egg/pycuda/driver.py",
line 2, in <module>
from pycuda._driver import *
ImportError: libcurand.so.4: wrong ELF class: ELFCLASS32
I am doing something wrong. Can you please elaborate ?
--
View this message in context: http://pycuda.2962900.n2.nabble.com/ImportError-No-module-named-driver-tp40…
Sent from the PyCuda mailing list archive at Nabble.com.
Hi,
I had a problem when explicitly passing "shared_size = None" to kernel prepared_call.
An example code is attached.
Error message:
ArgumentError: Python argument types in
Function._launch_kernel(Function, tuple, tuple, str, NoneType, NoneType)
did not match C++ signature:
_launch_kernel(pycuda::function {lvalue}, pycudaboost::python::tuple, pycudaboost::python::tuple, pycudaboost::python::api::object, unsigned int, pycudaboost::python::api::object)
I'm using 2012.1.
Best,
Yiyin
I think the solution was to do something like:
/dev/nvidia > /dev/null ,
not sure though
this seems relavent
http://reference.wolfram.com/mathematica/CUDALink/tutorial/Headless.htmlhttp://blog.njoubert.com/2010/10/running-cuda-without-running-x.html
they also suggested running nvidia-smi in persistant mode which could
either be the solution to your problem,or what you have been doing sto far.
Apostolis
2012/7/31 Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
> Ok, I think you found the source of my problem Apostolis.
>
> I profiled the execution both on the server and on the laptop and the
> calls to memcpy with no pinned were considerably faster on the servers
> Tesla than in the laptop's GT 540M and pinned memory transfer took
> about the same time in both.
>
> From your previous email, I decided to give a look into the (py)cuda
> initialization. So I removed the pycuda.autoinit import and made the
> initialization "by hand" to perform some rustic time-measuring. I
> added the following lines at the start of the benchmark() function:
> print "Starting initialization"
> cuda.init()
> dev = cuda.Device(0)
> ctx = dev.make_context()
> print "Initialization finished"
>
>
> So I ran this modified code, and in the laptop, it executed pretty
> fast, the time ellapsed between both prints less than a second. But
> when I ran it on the server, there were about 10 seconds between the
> first print and the last one.
>
> Upon the receiving of your last e-mail I ran nvidia-smi first and then
> the program with no changes. But then I tried leaving nvidia-smi
> looping with the -l argument and run the program on another tty and,
> to my surprise it ran in a little less than 2 seconds, against those
> nearly 15 when nvidia-smi ain't looping/executing.
> This is still slower than the laptop, but this particular code is not
> optimized for multi-gpu and there could be other factors like the
> communication latency over the PCI bus(which they told me in this list
> it's sometimes lower on laptops) and the fact that I am executing
> remotely via ssh.
>
> As for what you told me about mounting /dev/nvidia, I had to do that
> previously, because as I didn't install a GUI they wouldn't mount on
> boot-time and therefore cuda programs would not detect the devices (I
> had this problem after finishing CUDA installation and running the
> deviceQuery example from the SDK which gave me the "Non capable CUDA
> devices found").
>
> Any further ideas on why this nvidia-smi execution at the same time
> boosts initialization so much? You've been really helpful and I really
> appreciate your help, even if you cannot help me any more (I'll just
> have to wait those damned Nvidia forums come back :) )
>
>
> On Tue, Jul 31, 2012 at 1:52 PM, Apostolis Glenis <apostglen46(a)gmail.com>
> wrote:
> > I think is the same case.
> > The NVIDIA driver is initialized when X-windows starts or at the first
> > execution of a GPU program.
> > Could you try nvidia-smi first and then your program.
> > I have read somewhere (i think in the thrust-users mailing list) that you
> > have to load /dev/nvidia first or something like that.
> > The closer thing I could find was that:
> > http://www.gpugrid.net/forum_thread.php?id=266
> >
> >
> > 2012/7/31 Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
> >>
> >> Apostolis, I'm not using X windows, as I did not install any GUI on the
> >> server
> >>
> >> On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis
> >> <apostglen46(a)gmail.com> wrote:
> >> > maybe it has to do with the initialization of the GPU if another gpu
> is
> >> > responsible for X windows.
> >> >
> >> >
> >> > 2012/7/31 Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
> >> >>
> >> >> Just to add a concrete and simple example, that I gues will clarify
> mi
> >> >> situation. The following code creates two buffer on the host side,
> one
> >> >> pagelocked and the other a common one, and then copies/writes to a
> gpu
> >> >> buffer and evaluates performance using events for time measuring.
> >> >> It's really simple indeed, there's no execution on multiple gpu, but
> >> >> i would expect it to run in more or less the same time in the server
> >> >> using just one of the Teslas.
> >> >> However, it takes less than a second to run in my laptop and nearly
> 15
> >> >> seconds on the server!!!
> >> >>
> >> >> import pycuda.driver as cuda
> >> >> import pycuda.autoinit
> >> >> import numpy as np
> >> >>
> >> >> def benchmark(up):
> >> >> """ Up is a boolean flag. If set to True, benchmark is ran
> >> >> copying
> >> >> from
> >> >> host to device; if false, the benchmark is ran the
> >> >> other
> >> >> way round
> >> >> """
> >> >>
> >> >> # Buffers size
> >> >> size = 10*1024*1024
> >> >>
> >> >> # Host and device buffer, equally-shaped. We don't care about
> >> >> their contents
> >> >> cpu_buff = np.empty(size, np.dtype('u1'))
> >> >> cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
> >> >> gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
> >> >>
> >> >> # Events for measuring execution time; first two, for not
> >> >> pinned
> >> >> buffer,
> >> >> # las 2 for pinned(locked) buffer
> >> >> startn = cuda.Event()
> >> >> endn = cuda.Event()
> >> >> startl = cuda.Event()
> >> >> endl = cuda.Event()
> >> >>
> >> >> if (up):
> >> >> startn.record()
> >> >> cuda.memcpy_htod(gpu_buff, cpu_buff)
> >> >> endn.record()
> >> >> endn.synchronize()
> >> >> t1 = endn.time_since(startn)
> >> >>
> >> >> startl.record()
> >> >> cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
> >> >> endl.record()
> >> >> endl.synchronize()
> >> >> t2 = endl.time_since(startl)
> >> >>
> >> >> print "From host to device benchmark results: \n"
> >> >> print "Time for copying from normal host mem: %i
> ms\n"
> >> >> %
> >> >> t1
> >> >> print "Time for copying from pinned host mem: %i
> ms\n"
> >> >> %
> >> >> t2
> >> >>
> >> >> diff = t1-t2
> >> >> if (diff > 0):
> >> >> print "Copy from pinned memory was %i ms
> >> >> faster\n"
> >> >> % diff
> >> >> else:
> >> >> print "Copy from pinned memory was %i ms
> >> >> slower\n"
> >> >> % diff
> >> >>
> >> >> else:
> >> >> startn.record()
> >> >> cuda.memcpy_dtoh(cpu_buff, gpu_buff)
> >> >> endn.record()
> >> >> endn.synchronize()
> >> >> t1 = endn.time_since(startn)
> >> >>
> >> >> startl.record()
> >> >> cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
> >> >> endl.record()
> >> >> endl.synchronize()
> >> >> t2 = endl.time_since(startl)
> >> >>
> >> >> print "From device to host benchmark results: \n"
> >> >> print "Time for copying to normal host mem: %i ms\n"
> %
> >> >> t1
> >> >> print "Time for copying to pinned host mem: %i ms\n"
> %
> >> >> t2
> >> >>
> >> >> diff = t1-t2
> >> >> if (diff > 0):
> >> >> print "Copy to pinned memory was %i ms
> >> >> faster\n" %
> >> >> diff
> >> >> else:
> >> >> print "Copy to pinned memory was %i ms
> >> >> slower\n" %
> >> >> diff
> >> >>
> >> >> benchmark(up=False)
> >> >>
> >> >>
> >> >> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
> >> >> <leandrodemarco(a)gmail.com> wrote:
> >> >> > ---------- Forwarded message ----------
> >> >> > From: Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
> >> >> > Date: Mon, Jul 30, 2012 at 2:57 PM
> >> >> > Subject: Re: [PyCUDA] Performance Issues
> >> >> > To: Brendan Wood <wood(a)synchroverge.com>, lists(a)informa.tiker.ne
> >> >> >
> >> >> >
> >> >> > Brendan:
> >> >> > Basically, all the examples are computing the dot product of 2
> large
> >> >> > vectors. But in each example some new concept is introduced (pinned
> >> >> > memory, streams, etc).
> >> >> > The last example is the one that incorporates multiple-gpu.
> >> >> >
> >> >> > As for the work done, I am generating the data randomly and, making
> >> >> > some tests at the end in the host side, which considerably
> increases
> >> >> > ex ecution time, but as this are "learning examples" I was not
> >> >> > specially worried about it. But I would have expected that given
> that
> >> >> > the server has way more powerful hardware (the 3 teslas 2075 and 4
> >> >> > intel xeon with 6 cores each and 48 GB ram) programs would run
> >> >> > faster,
> >> >> > in particular this last example that is designed to work with
> >> >> > multiples-gpu's.
> >> >> >
> >> >> > I compiled and ran the bandwith test and the queryDevice samples
> from
> >> >> > the SDK and they both passed, if that is what you meant.
> >> >> >
> >> >> > Now answering to Andreas:
> >> >> > yes, I'm using one thread per each GPU (as the way it's done in the
> >> >> > wiki example) and yes, the server has way more than 3 CPU's. As for
> >> >> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for
> >> >> > each
> >> >> > device context. What does this flag do?
> >> >> >
> >> >> > Thank you both for your answers
> >> >> >
> >> >> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood
> >> >> > <wood(a)synchroverge.com>
> >> >> > wrote:
> >> >> >> Hi Leandro,
> >> >> >>
> >> >> >> Without knowing exactly what examples you're running, it may be
> hard
> >> >> >> to
> >> >> >> say what the problem is. In fact, you may not really have a
> >> >> >> problem.
> >> >> >>
> >> >> >> How much work is being done in each example program? Is it enough
> >> >> >> to
> >> >> >> really work the GPU, or is communication and other overhead
> >> >> >> dominating
> >> >> >> runtime? Note that laptops may have lower communication latency
> >> >> >> over
> >> >> >> the PCI bus than desktops/servers, which can make small programs
> run
> >> >> >> much faster on laptops regardless of how much processing power the
> >> >> >> GPU
> >> >> >> has.
> >> >> >>
> >> >> >> Have you tried running the sample code from the SDK, so that you
> can
> >> >> >> verify that it's not a code problem?
> >> >> >>
> >> >> >> Regards,
> >> >> >>
> >> >> >> Brendan Wood
> >> >> >>
> >> >> >>
> >> >> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
> >> >> >>> Hello: I've been reading and learning CUDA in the last few weeks
> >> >> >>> and
> >> >> >>> last week I started writing (translating to Pycuda from Cuda-C)
> >> >> >>> some
> >> >> >>> examples taken from the book "Cuda by Example".
> >> >> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M
> >> >> >>> if
> >> >> >>> my memory is allright) with Windows 7.
> >> >> >>>
> >> >> >>> But in the project I'm currently working at, we intend to run
> >> >> >>> (py)cuda
> >> >> >>> on a multi-gpu server that has three Tesla C2075 cards.
> >> >> >>>
> >> >> >>> So I installed Ubuntu server 10.10 (with no GUI) and managed to
> >> >> >>> install and get running the very same examples I ran on the
> >> >> >>> single-gpu
> >> >> >>> laptop. However they run really slow, in some cases it takes 3
> >> >> >>> times
> >> >> >>> more than in the laptop. And this happens with most, if not all,
> >> >> >>> the
> >> >> >>> examples I wrote.
> >> >> >>>
> >> >> >>> I thought it could be a driver issue but I double-checked and
> I've
> >> >> >>> installed the correct ones, meaning those listed on the CUDA Zone
> >> >> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost
> right
> >> >> >>> now
> >> >> >>> and was wondering if anyone has had this or somewhat similar
> >> >> >>> problem
> >> >> >>> running on a server.
> >> >> >>>
> >> >> >>> Sorry for the English, but it's not my native language.
> >> >> >>>
> >> >> >>> Thanks in advance, Leandro Demarco
> >> >> >>>
> >> >> >>> _______________________________________________
> >> >> >>> PyCUDA mailing list
> >> >> >>> PyCUDA(a)tiker.net
> >> >> >>> http://lists.tiker.net/listinfo/pycuda
> >> >> >>
> >> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> PyCUDA mailing list
> >> >> PyCUDA(a)tiker.net
> >> >> http://lists.tiker.net/listinfo/pycuda
> >> >
> >> >
> >
> >
>
Ok, I think you found the source of my problem Apostolis.
I profiled the execution both on the server and on the laptop and the
calls to memcpy with no pinned were considerably faster on the servers
Tesla than in the laptop's GT 540M and pinned memory transfer took
about the same time in both.
>From your previous email, I decided to give a look into the (py)cuda
initialization. So I removed the pycuda.autoinit import and made the
initialization "by hand" to perform some rustic time-measuring. I
added the following lines at the start of the benchmark() function:
print "Starting initialization"
cuda.init()
dev = cuda.Device(0)
ctx = dev.make_context()
print "Initialization finished"
So I ran this modified code, and in the laptop, it executed pretty
fast, the time ellapsed between both prints less than a second. But
when I ran it on the server, there were about 10 seconds between the
first print and the last one.
Upon the receiving of your last e-mail I ran nvidia-smi first and then
the program with no changes. But then I tried leaving nvidia-smi
looping with the -l argument and run the program on another tty and,
to my surprise it ran in a little less than 2 seconds, against those
nearly 15 when nvidia-smi ain't looping/executing.
This is still slower than the laptop, but this particular code is not
optimized for multi-gpu and there could be other factors like the
communication latency over the PCI bus(which they told me in this list
it's sometimes lower on laptops) and the fact that I am executing
remotely via ssh.
As for what you told me about mounting /dev/nvidia, I had to do that
previously, because as I didn't install a GUI they wouldn't mount on
boot-time and therefore cuda programs would not detect the devices (I
had this problem after finishing CUDA installation and running the
deviceQuery example from the SDK which gave me the "Non capable CUDA
devices found").
Any further ideas on why this nvidia-smi execution at the same time
boosts initialization so much? You've been really helpful and I really
appreciate your help, even if you cannot help me any more (I'll just
have to wait those damned Nvidia forums come back :) )
On Tue, Jul 31, 2012 at 1:52 PM, Apostolis Glenis <apostglen46(a)gmail.com> wrote:
> I think is the same case.
> The NVIDIA driver is initialized when X-windows starts or at the first
> execution of a GPU program.
> Could you try nvidia-smi first and then your program.
> I have read somewhere (i think in the thrust-users mailing list) that you
> have to load /dev/nvidia first or something like that.
> The closer thing I could find was that:
> http://www.gpugrid.net/forum_thread.php?id=266
>
>
> 2012/7/31 Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
>>
>> Apostolis, I'm not using X windows, as I did not install any GUI on the
>> server
>>
>> On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis
>> <apostglen46(a)gmail.com> wrote:
>> > maybe it has to do with the initialization of the GPU if another gpu is
>> > responsible for X windows.
>> >
>> >
>> > 2012/7/31 Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
>> >>
>> >> Just to add a concrete and simple example, that I gues will clarify mi
>> >> situation. The following code creates two buffer on the host side, one
>> >> pagelocked and the other a common one, and then copies/writes to a gpu
>> >> buffer and evaluates performance using events for time measuring.
>> >> It's really simple indeed, there's no execution on multiple gpu, but
>> >> i would expect it to run in more or less the same time in the server
>> >> using just one of the Teslas.
>> >> However, it takes less than a second to run in my laptop and nearly 15
>> >> seconds on the server!!!
>> >>
>> >> import pycuda.driver as cuda
>> >> import pycuda.autoinit
>> >> import numpy as np
>> >>
>> >> def benchmark(up):
>> >> """ Up is a boolean flag. If set to True, benchmark is ran
>> >> copying
>> >> from
>> >> host to device; if false, the benchmark is ran the
>> >> other
>> >> way round
>> >> """
>> >>
>> >> # Buffers size
>> >> size = 10*1024*1024
>> >>
>> >> # Host and device buffer, equally-shaped. We don't care about
>> >> their contents
>> >> cpu_buff = np.empty(size, np.dtype('u1'))
>> >> cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
>> >> gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
>> >>
>> >> # Events for measuring execution time; first two, for not
>> >> pinned
>> >> buffer,
>> >> # las 2 for pinned(locked) buffer
>> >> startn = cuda.Event()
>> >> endn = cuda.Event()
>> >> startl = cuda.Event()
>> >> endl = cuda.Event()
>> >>
>> >> if (up):
>> >> startn.record()
>> >> cuda.memcpy_htod(gpu_buff, cpu_buff)
>> >> endn.record()
>> >> endn.synchronize()
>> >> t1 = endn.time_since(startn)
>> >>
>> >> startl.record()
>> >> cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
>> >> endl.record()
>> >> endl.synchronize()
>> >> t2 = endl.time_since(startl)
>> >>
>> >> print "From host to device benchmark results: \n"
>> >> print "Time for copying from normal host mem: %i ms\n"
>> >> %
>> >> t1
>> >> print "Time for copying from pinned host mem: %i ms\n"
>> >> %
>> >> t2
>> >>
>> >> diff = t1-t2
>> >> if (diff > 0):
>> >> print "Copy from pinned memory was %i ms
>> >> faster\n"
>> >> % diff
>> >> else:
>> >> print "Copy from pinned memory was %i ms
>> >> slower\n"
>> >> % diff
>> >>
>> >> else:
>> >> startn.record()
>> >> cuda.memcpy_dtoh(cpu_buff, gpu_buff)
>> >> endn.record()
>> >> endn.synchronize()
>> >> t1 = endn.time_since(startn)
>> >>
>> >> startl.record()
>> >> cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
>> >> endl.record()
>> >> endl.synchronize()
>> >> t2 = endl.time_since(startl)
>> >>
>> >> print "From device to host benchmark results: \n"
>> >> print "Time for copying to normal host mem: %i ms\n" %
>> >> t1
>> >> print "Time for copying to pinned host mem: %i ms\n" %
>> >> t2
>> >>
>> >> diff = t1-t2
>> >> if (diff > 0):
>> >> print "Copy to pinned memory was %i ms
>> >> faster\n" %
>> >> diff
>> >> else:
>> >> print "Copy to pinned memory was %i ms
>> >> slower\n" %
>> >> diff
>> >>
>> >> benchmark(up=False)
>> >>
>> >>
>> >> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
>> >> <leandrodemarco(a)gmail.com> wrote:
>> >> > ---------- Forwarded message ----------
>> >> > From: Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
>> >> > Date: Mon, Jul 30, 2012 at 2:57 PM
>> >> > Subject: Re: [PyCUDA] Performance Issues
>> >> > To: Brendan Wood <wood(a)synchroverge.com>, lists(a)informa.tiker.ne
>> >> >
>> >> >
>> >> > Brendan:
>> >> > Basically, all the examples are computing the dot product of 2 large
>> >> > vectors. But in each example some new concept is introduced (pinned
>> >> > memory, streams, etc).
>> >> > The last example is the one that incorporates multiple-gpu.
>> >> >
>> >> > As for the work done, I am generating the data randomly and, making
>> >> > some tests at the end in the host side, which considerably increases
>> >> > ex ecution time, but as this are "learning examples" I was not
>> >> > specially worried about it. But I would have expected that given that
>> >> > the server has way more powerful hardware (the 3 teslas 2075 and 4
>> >> > intel xeon with 6 cores each and 48 GB ram) programs would run
>> >> > faster,
>> >> > in particular this last example that is designed to work with
>> >> > multiples-gpu's.
>> >> >
>> >> > I compiled and ran the bandwith test and the queryDevice samples from
>> >> > the SDK and they both passed, if that is what you meant.
>> >> >
>> >> > Now answering to Andreas:
>> >> > yes, I'm using one thread per each GPU (as the way it's done in the
>> >> > wiki example) and yes, the server has way more than 3 CPU's. As for
>> >> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for
>> >> > each
>> >> > device context. What does this flag do?
>> >> >
>> >> > Thank you both for your answers
>> >> >
>> >> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood
>> >> > <wood(a)synchroverge.com>
>> >> > wrote:
>> >> >> Hi Leandro,
>> >> >>
>> >> >> Without knowing exactly what examples you're running, it may be hard
>> >> >> to
>> >> >> say what the problem is. In fact, you may not really have a
>> >> >> problem.
>> >> >>
>> >> >> How much work is being done in each example program? Is it enough
>> >> >> to
>> >> >> really work the GPU, or is communication and other overhead
>> >> >> dominating
>> >> >> runtime? Note that laptops may have lower communication latency
>> >> >> over
>> >> >> the PCI bus than desktops/servers, which can make small programs run
>> >> >> much faster on laptops regardless of how much processing power the
>> >> >> GPU
>> >> >> has.
>> >> >>
>> >> >> Have you tried running the sample code from the SDK, so that you can
>> >> >> verify that it's not a code problem?
>> >> >>
>> >> >> Regards,
>> >> >>
>> >> >> Brendan Wood
>> >> >>
>> >> >>
>> >> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
>> >> >>> Hello: I've been reading and learning CUDA in the last few weeks
>> >> >>> and
>> >> >>> last week I started writing (translating to Pycuda from Cuda-C)
>> >> >>> some
>> >> >>> examples taken from the book "Cuda by Example".
>> >> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M
>> >> >>> if
>> >> >>> my memory is allright) with Windows 7.
>> >> >>>
>> >> >>> But in the project I'm currently working at, we intend to run
>> >> >>> (py)cuda
>> >> >>> on a multi-gpu server that has three Tesla C2075 cards.
>> >> >>>
>> >> >>> So I installed Ubuntu server 10.10 (with no GUI) and managed to
>> >> >>> install and get running the very same examples I ran on the
>> >> >>> single-gpu
>> >> >>> laptop. However they run really slow, in some cases it takes 3
>> >> >>> times
>> >> >>> more than in the laptop. And this happens with most, if not all,
>> >> >>> the
>> >> >>> examples I wrote.
>> >> >>>
>> >> >>> I thought it could be a driver issue but I double-checked and I've
>> >> >>> installed the correct ones, meaning those listed on the CUDA Zone
>> >> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost right
>> >> >>> now
>> >> >>> and was wondering if anyone has had this or somewhat similar
>> >> >>> problem
>> >> >>> running on a server.
>> >> >>>
>> >> >>> Sorry for the English, but it's not my native language.
>> >> >>>
>> >> >>> Thanks in advance, Leandro Demarco
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> PyCUDA mailing list
>> >> >>> PyCUDA(a)tiker.net
>> >> >>> http://lists.tiker.net/listinfo/pycuda
>> >> >>
>> >> >>
>> >>
>> >> _______________________________________________
>> >> PyCUDA mailing list
>> >> PyCUDA(a)tiker.net
>> >> http://lists.tiker.net/listinfo/pycuda
>> >
>> >
>
>
Apostolis, I'm not using X windows, as I did not install any GUI on the server
On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis
<apostglen46(a)gmail.com> wrote:
> maybe it has to do with the initialization of the GPU if another gpu is
> responsible for X windows.
>
>
> 2012/7/31 Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
>>
>> Just to add a concrete and simple example, that I gues will clarify mi
>> situation. The following code creates two buffer on the host side, one
>> pagelocked and the other a common one, and then copies/writes to a gpu
>> buffer and evaluates performance using events for time measuring.
>> It's really simple indeed, there's no execution on multiple gpu, but
>> i would expect it to run in more or less the same time in the server
>> using just one of the Teslas.
>> However, it takes less than a second to run in my laptop and nearly 15
>> seconds on the server!!!
>>
>> import pycuda.driver as cuda
>> import pycuda.autoinit
>> import numpy as np
>>
>> def benchmark(up):
>> """ Up is a boolean flag. If set to True, benchmark is ran copying
>> from
>> host to device; if false, the benchmark is ran the other
>> way round
>> """
>>
>> # Buffers size
>> size = 10*1024*1024
>>
>> # Host and device buffer, equally-shaped. We don't care about
>> their contents
>> cpu_buff = np.empty(size, np.dtype('u1'))
>> cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
>> gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
>>
>> # Events for measuring execution time; first two, for not pinned
>> buffer,
>> # las 2 for pinned(locked) buffer
>> startn = cuda.Event()
>> endn = cuda.Event()
>> startl = cuda.Event()
>> endl = cuda.Event()
>>
>> if (up):
>> startn.record()
>> cuda.memcpy_htod(gpu_buff, cpu_buff)
>> endn.record()
>> endn.synchronize()
>> t1 = endn.time_since(startn)
>>
>> startl.record()
>> cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
>> endl.record()
>> endl.synchronize()
>> t2 = endl.time_since(startl)
>>
>> print "From host to device benchmark results: \n"
>> print "Time for copying from normal host mem: %i ms\n" %
>> t1
>> print "Time for copying from pinned host mem: %i ms\n" %
>> t2
>>
>> diff = t1-t2
>> if (diff > 0):
>> print "Copy from pinned memory was %i ms faster\n"
>> % diff
>> else:
>> print "Copy from pinned memory was %i ms slower\n"
>> % diff
>>
>> else:
>> startn.record()
>> cuda.memcpy_dtoh(cpu_buff, gpu_buff)
>> endn.record()
>> endn.synchronize()
>> t1 = endn.time_since(startn)
>>
>> startl.record()
>> cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
>> endl.record()
>> endl.synchronize()
>> t2 = endl.time_since(startl)
>>
>> print "From device to host benchmark results: \n"
>> print "Time for copying to normal host mem: %i ms\n" % t1
>> print "Time for copying to pinned host mem: %i ms\n" % t2
>>
>> diff = t1-t2
>> if (diff > 0):
>> print "Copy to pinned memory was %i ms faster\n" %
>> diff
>> else:
>> print "Copy to pinned memory was %i ms slower\n" %
>> diff
>>
>> benchmark(up=False)
>>
>>
>> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
>> <leandrodemarco(a)gmail.com> wrote:
>> > ---------- Forwarded message ----------
>> > From: Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
>> > Date: Mon, Jul 30, 2012 at 2:57 PM
>> > Subject: Re: [PyCUDA] Performance Issues
>> > To: Brendan Wood <wood(a)synchroverge.com>, lists(a)informa.tiker.ne
>> >
>> >
>> > Brendan:
>> > Basically, all the examples are computing the dot product of 2 large
>> > vectors. But in each example some new concept is introduced (pinned
>> > memory, streams, etc).
>> > The last example is the one that incorporates multiple-gpu.
>> >
>> > As for the work done, I am generating the data randomly and, making
>> > some tests at the end in the host side, which considerably increases
>> > ex ecution time, but as this are "learning examples" I was not
>> > specially worried about it. But I would have expected that given that
>> > the server has way more powerful hardware (the 3 teslas 2075 and 4
>> > intel xeon with 6 cores each and 48 GB ram) programs would run faster,
>> > in particular this last example that is designed to work with
>> > multiples-gpu's.
>> >
>> > I compiled and ran the bandwith test and the queryDevice samples from
>> > the SDK and they both passed, if that is what you meant.
>> >
>> > Now answering to Andreas:
>> > yes, I'm using one thread per each GPU (as the way it's done in the
>> > wiki example) and yes, the server has way more than 3 CPU's. As for
>> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for each
>> > device context. What does this flag do?
>> >
>> > Thank you both for your answers
>> >
>> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood <wood(a)synchroverge.com>
>> > wrote:
>> >> Hi Leandro,
>> >>
>> >> Without knowing exactly what examples you're running, it may be hard to
>> >> say what the problem is. In fact, you may not really have a problem.
>> >>
>> >> How much work is being done in each example program? Is it enough to
>> >> really work the GPU, or is communication and other overhead dominating
>> >> runtime? Note that laptops may have lower communication latency over
>> >> the PCI bus than desktops/servers, which can make small programs run
>> >> much faster on laptops regardless of how much processing power the GPU
>> >> has.
>> >>
>> >> Have you tried running the sample code from the SDK, so that you can
>> >> verify that it's not a code problem?
>> >>
>> >> Regards,
>> >>
>> >> Brendan Wood
>> >>
>> >>
>> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
>> >>> Hello: I've been reading and learning CUDA in the last few weeks and
>> >>> last week I started writing (translating to Pycuda from Cuda-C) some
>> >>> examples taken from the book "Cuda by Example".
>> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M if
>> >>> my memory is allright) with Windows 7.
>> >>>
>> >>> But in the project I'm currently working at, we intend to run (py)cuda
>> >>> on a multi-gpu server that has three Tesla C2075 cards.
>> >>>
>> >>> So I installed Ubuntu server 10.10 (with no GUI) and managed to
>> >>> install and get running the very same examples I ran on the single-gpu
>> >>> laptop. However they run really slow, in some cases it takes 3 times
>> >>> more than in the laptop. And this happens with most, if not all, the
>> >>> examples I wrote.
>> >>>
>> >>> I thought it could be a driver issue but I double-checked and I've
>> >>> installed the correct ones, meaning those listed on the CUDA Zone
>> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost right now
>> >>> and was wondering if anyone has had this or somewhat similar problem
>> >>> running on a server.
>> >>>
>> >>> Sorry for the English, but it's not my native language.
>> >>>
>> >>> Thanks in advance, Leandro Demarco
>> >>>
>> >>> _______________________________________________
>> >>> PyCUDA mailing list
>> >>> PyCUDA(a)tiker.net
>> >>> http://lists.tiker.net/listinfo/pycuda
>> >>
>> >>
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA(a)tiker.net
>> http://lists.tiker.net/listinfo/pycuda
>
>
Just to add a concrete and simple example, that I gues will clarify mi
situation. The following code creates two buffer on the host side, one
pagelocked and the other a common one, and then copies/writes to a gpu
buffer and evaluates performance using events for time measuring.
It's really simple indeed, there's no execution on multiple gpu, but
i would expect it to run in more or less the same time in the server
using just one of the Teslas.
However, it takes less than a second to run in my laptop and nearly 15
seconds on the server!!!
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
def benchmark(up):
""" Up is a boolean flag. If set to True, benchmark is ran copying from
host to device; if false, the benchmark is ran the other way round
"""
# Buffers size
size = 10*1024*1024
# Host and device buffer, equally-shaped. We don't care about their contents
cpu_buff = np.empty(size, np.dtype('u1'))
cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
# Events for measuring execution time; first two, for not pinned buffer,
# las 2 for pinned(locked) buffer
startn = cuda.Event()
endn = cuda.Event()
startl = cuda.Event()
endl = cuda.Event()
if (up):
startn.record()
cuda.memcpy_htod(gpu_buff, cpu_buff)
endn.record()
endn.synchronize()
t1 = endn.time_since(startn)
startl.record()
cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
endl.record()
endl.synchronize()
t2 = endl.time_since(startl)
print "From host to device benchmark results: \n"
print "Time for copying from normal host mem: %i ms\n" % t1
print "Time for copying from pinned host mem: %i ms\n" % t2
diff = t1-t2
if (diff > 0):
print "Copy from pinned memory was %i ms faster\n" % diff
else:
print "Copy from pinned memory was %i ms slower\n" % diff
else:
startn.record()
cuda.memcpy_dtoh(cpu_buff, gpu_buff)
endn.record()
endn.synchronize()
t1 = endn.time_since(startn)
startl.record()
cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
endl.record()
endl.synchronize()
t2 = endl.time_since(startl)
print "From device to host benchmark results: \n"
print "Time for copying to normal host mem: %i ms\n" % t1
print "Time for copying to pinned host mem: %i ms\n" % t2
diff = t1-t2
if (diff > 0):
print "Copy to pinned memory was %i ms faster\n" % diff
else:
print "Copy to pinned memory was %i ms slower\n" % diff
benchmark(up=False)
On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
<leandrodemarco(a)gmail.com> wrote:
> ---------- Forwarded message ----------
> From: Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
> Date: Mon, Jul 30, 2012 at 2:57 PM
> Subject: Re: [PyCUDA] Performance Issues
> To: Brendan Wood <wood(a)synchroverge.com>, lists(a)informa.tiker.ne
>
>
> Brendan:
> Basically, all the examples are computing the dot product of 2 large
> vectors. But in each example some new concept is introduced (pinned
> memory, streams, etc).
> The last example is the one that incorporates multiple-gpu.
>
> As for the work done, I am generating the data randomly and, making
> some tests at the end in the host side, which considerably increases
> ex ecution time, but as this are "learning examples" I was not
> specially worried about it. But I would have expected that given that
> the server has way more powerful hardware (the 3 teslas 2075 and 4
> intel xeon with 6 cores each and 48 GB ram) programs would run faster,
> in particular this last example that is designed to work with
> multiples-gpu's.
>
> I compiled and ran the bandwith test and the queryDevice samples from
> the SDK and they both passed, if that is what you meant.
>
> Now answering to Andreas:
> yes, I'm using one thread per each GPU (as the way it's done in the
> wiki example) and yes, the server has way more than 3 CPU's. As for
> the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for each
> device context. What does this flag do?
>
> Thank you both for your answers
>
> On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood <wood(a)synchroverge.com> wrote:
>> Hi Leandro,
>>
>> Without knowing exactly what examples you're running, it may be hard to
>> say what the problem is. In fact, you may not really have a problem.
>>
>> How much work is being done in each example program? Is it enough to
>> really work the GPU, or is communication and other overhead dominating
>> runtime? Note that laptops may have lower communication latency over
>> the PCI bus than desktops/servers, which can make small programs run
>> much faster on laptops regardless of how much processing power the GPU
>> has.
>>
>> Have you tried running the sample code from the SDK, so that you can
>> verify that it's not a code problem?
>>
>> Regards,
>>
>> Brendan Wood
>>
>>
>> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
>>> Hello: I've been reading and learning CUDA in the last few weeks and
>>> last week I started writing (translating to Pycuda from Cuda-C) some
>>> examples taken from the book "Cuda by Example".
>>> I started coding on a laptop with just one nvidia GPU (a gtx 560M if
>>> my memory is allright) with Windows 7.
>>>
>>> But in the project I'm currently working at, we intend to run (py)cuda
>>> on a multi-gpu server that has three Tesla C2075 cards.
>>>
>>> So I installed Ubuntu server 10.10 (with no GUI) and managed to
>>> install and get running the very same examples I ran on the single-gpu
>>> laptop. However they run really slow, in some cases it takes 3 times
>>> more than in the laptop. And this happens with most, if not all, the
>>> examples I wrote.
>>>
>>> I thought it could be a driver issue but I double-checked and I've
>>> installed the correct ones, meaning those listed on the CUDA Zone
>>> section of nvidia.com for linux 64-bits. So I'm kind of lost right now
>>> and was wondering if anyone has had this or somewhat similar problem
>>> running on a server.
>>>
>>> Sorry for the English, but it's not my native language.
>>>
>>> Thanks in advance, Leandro Demarco
>>>
>>> _______________________________________________
>>> PyCUDA mailing list
>>> PyCUDA(a)tiker.net
>>> http://lists.tiker.net/listinfo/pycuda
>>
>>
---------- Forwarded message ----------
From: Leandro Demarco Vedelago <leandrodemarco(a)gmail.com>
Date: Mon, Jul 30, 2012 at 2:57 PM
Subject: Re: [PyCUDA] Performance Issues
To: Brendan Wood <wood(a)synchroverge.com>, lists(a)informa.tiker.ne
Brendan:
Basically, all the examples are computing the dot product of 2 large
vectors. But in each example some new concept is introduced (pinned
memory, streams, etc).
The last example is the one that incorporates multiple-gpu.
As for the work done, I am generating the data randomly and, making
some tests at the end in the host side, which considerably increases
ex ecution time, but as this are "learning examples" I was not
specially worried about it. But I would have expected that given that
the server has way more powerful hardware (the 3 teslas 2075 and 4
intel xeon with 6 cores each and 48 GB ram) programs would run faster,
in particular this last example that is designed to work with
multiples-gpu's.
I compiled and ran the bandwith test and the queryDevice samples from
the SDK and they both passed, if that is what you meant.
Now answering to Andreas:
yes, I'm using one thread per each GPU (as the way it's done in the
wiki example) and yes, the server has way more than 3 CPU's. As for
the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for each
device context. What does this flag do?
Thank you both for your answers
On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood <wood(a)synchroverge.com> wrote:
> Hi Leandro,
>
> Without knowing exactly what examples you're running, it may be hard to
> say what the problem is. In fact, you may not really have a problem.
>
> How much work is being done in each example program? Is it enough to
> really work the GPU, or is communication and other overhead dominating
> runtime? Note that laptops may have lower communication latency over
> the PCI bus than desktops/servers, which can make small programs run
> much faster on laptops regardless of how much processing power the GPU
> has.
>
> Have you tried running the sample code from the SDK, so that you can
> verify that it's not a code problem?
>
> Regards,
>
> Brendan Wood
>
>
> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
>> Hello: I've been reading and learning CUDA in the last few weeks and
>> last week I started writing (translating to Pycuda from Cuda-C) some
>> examples taken from the book "Cuda by Example".
>> I started coding on a laptop with just one nvidia GPU (a gtx 560M if
>> my memory is allright) with Windows 7.
>>
>> But in the project I'm currently working at, we intend to run (py)cuda
>> on a multi-gpu server that has three Tesla C2075 cards.
>>
>> So I installed Ubuntu server 10.10 (with no GUI) and managed to
>> install and get running the very same examples I ran on the single-gpu
>> laptop. However they run really slow, in some cases it takes 3 times
>> more than in the laptop. And this happens with most, if not all, the
>> examples I wrote.
>>
>> I thought it could be a driver issue but I double-checked and I've
>> installed the correct ones, meaning those listed on the CUDA Zone
>> section of nvidia.com for linux 64-bits. So I'm kind of lost right now
>> and was wondering if anyone has had this or somewhat similar problem
>> running on a server.
>>
>> Sorry for the English, but it's not my native language.
>>
>> Thanks in advance, Leandro Demarco
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA(a)tiker.net
>> http://lists.tiker.net/listinfo/pycuda
>
>
Leandro Demarco Vedelago <leandrodemarco(a)gmail.com> writes:
> Hello: I've been reading and learning CUDA in the last few weeks and
> last week I started writing (translating to Pycuda from Cuda-C) some
> examples taken from the book "Cuda by Example".
> I started coding on a laptop with just one nvidia GPU (a gtx 560M if
> my memory is allright) with Windows 7.
>
> But in the project I'm currently working at, we intend to run (py)cuda
> on a multi-gpu server that has three Tesla C2075 cards.
>
> So I installed Ubuntu server 10.10 (with no GUI) and managed to
> install and get running the very same examples I ran on the single-gpu
> laptop. However they run really slow, in some cases it takes 3 times
> more than in the laptop. And this happens with most, if not all, the
> examples I wrote.
How do you control the multiple GPUs? By threading? How many CPU cores do
you have in the machine? (Should be >= 3.)
Also try and switch away from the busy-wait sync:
http://documen.tician.de/pycuda/driver.html#pycuda.driver.ctx_flags.SCHED_B…
HTH,
Andreas
Hi Leandro,
Without knowing exactly what examples you're running, it may be hard to
say what the problem is. In fact, you may not really have a problem.
How much work is being done in each example program? Is it enough to
really work the GPU, or is communication and other overhead dominating
runtime? Note that laptops may have lower communication latency over
the PCI bus than desktops/servers, which can make small programs run
much faster on laptops regardless of how much processing power the GPU
has.
Have you tried running the sample code from the SDK, so that you can
verify that it's not a code problem?
Regards,
Brendan Wood
On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
> Hello: I've been reading and learning CUDA in the last few weeks and
> last week I started writing (translating to Pycuda from Cuda-C) some
> examples taken from the book "Cuda by Example".
> I started coding on a laptop with just one nvidia GPU (a gtx 560M if
> my memory is allright) with Windows 7.
>
> But in the project I'm currently working at, we intend to run (py)cuda
> on a multi-gpu server that has three Tesla C2075 cards.
>
> So I installed Ubuntu server 10.10 (with no GUI) and managed to
> install and get running the very same examples I ran on the single-gpu
> laptop. However they run really slow, in some cases it takes 3 times
> more than in the laptop. And this happens with most, if not all, the
> examples I wrote.
>
> I thought it could be a driver issue but I double-checked and I've
> installed the correct ones, meaning those listed on the CUDA Zone
> section of nvidia.com for linux 64-bits. So I'm kind of lost right now
> and was wondering if anyone has had this or somewhat similar problem
> running on a server.
>
> Sorry for the English, but it's not my native language.
>
> Thanks in advance, Leandro Demarco
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
Hello: I've been reading and learning CUDA in the last few weeks and
last week I started writing (translating to Pycuda from Cuda-C) some
examples taken from the book "Cuda by Example".
I started coding on a laptop with just one nvidia GPU (a gtx 560M if
my memory is allright) with Windows 7.
But in the project I'm currently working at, we intend to run (py)cuda
on a multi-gpu server that has three Tesla C2075 cards.
So I installed Ubuntu server 10.10 (with no GUI) and managed to
install and get running the very same examples I ran on the single-gpu
laptop. However they run really slow, in some cases it takes 3 times
more than in the laptop. And this happens with most, if not all, the
examples I wrote.
I thought it could be a driver issue but I double-checked and I've
installed the correct ones, meaning those listed on the CUDA Zone
section of nvidia.com for linux 64-bits. So I'm kind of lost right now
and was wondering if anyone has had this or somewhat similar problem
running on a server.
Sorry for the English, but it's not my native language.
Thanks in advance, Leandro Demarco