On Fri, May 6, 2011 at 4:34 AM, Patric Holmvall <patric.hol(a)gmail.com>wrote:
Thank you for the reply Andreas.
I don't think that multiple kernels would be an option in this case, as the
entire kernel is pretty much a serial loop. Multiple kernels mean that we
would just split up this loop in many successions in powers of 10. Maybe it
helps a bit if I explain what we're trying to do.
The whole goal of our project is to implement monte carlo/metropolis
computations on path integrals/fields in an efficient way on the GPU. We
have a huge amount of "paths" that we want to randomly generate and modify.
This is a serial process as each modification is based on all the previous
modifications. What makes it a good idea to do on GPU in parallel is that we
have many dimensions that we parallelize. Each path needs to be modified
very many times and for each modification to the path we need to calculate
the energy and an operator over the path. This is the serial computation
that is the major time contributor and leads us to the desire to make
calculations over 5 seconds on each thread. What we really would like to do
is to do modifications on these paths up to minutes/hours.
Actually this is a major effort in molecular dynamics too. We want to run
millions of simulations with different initial conditions to catch
interesting events in action, or get the statistical distribution of
A CPU side scheduler would be nice to queue up kernels above 5 seconds on
platforms where it is a problem, but you can just write that in Python. The
trick is to run a re-usable GPU kernel that saves the data it needs for the
next call. In molecular dynamics this is easy because you are calculating
the positions and velocities of the molecules for the next iteration.
The problem is if something else on your OS uses the GPU before your next
call and wipes the data you have on there in global shared memory. As more
applications use GPU I see this being a major problem. For your
multithreaded app? I would pass a token between threads giving them full
access to the GPU until they are done using it, or have one thread be in
charge of scheduling the GPU calls.
Just as a general guideline, we are talking about these kind of numbers:
Threads: > 2^12 or higher (these represent the paths)
Modifications per path: > 10^6 (this is the limit before the time error
kicks in, we want to go higher)
So is there any efficient way to use the GPU as a computation device for
the same kind of computations for several minutes?
On Thu, May 5, 2011 at 8:03 PM, Andreas Kloeckner <lists(a)informa.tiker.net
> On Thu, 5 May 2011 19:43:46 +0200, Patric Holmvall <patric.hol(a)gmail.com>
> > Hi again,
> > Andreas:
> > I understand that this restriction is because we are working with the
> GPU -
> > you wouldn't want to occupy it with heavy calculations more than a few
> > seconds because that causes you to be unable to use the computer
> > So how would you go about if you want to do longer calculations than 5
> > seconds? Is there any efficient way to re-run the kernel etc?
> Kernel launches have an overhead measured in microseconds. Breaking up
> your work into multiple kernels is good practice.
> > Tomasz:
> > My bad. I didn't do the installation on that system and was misinformed.
> > package was in fact from PyPI.
> > By the way, any clue when PyOpenCL AMD/ATI support will be available for
> > Debian? ATI Stream SDK is currently available for Ubuntu, so it would
> > probably be soon?
> AMD Stream works fine on Debian, I'm using it all the time. (just not
> with libc 2.13, but 2.11 works)
PyOpenCL mailing list