Hi,
I was wondering if anyone had tested combining multiple cards of
different types (AMD and nVidia), and whether it can lead to any sort of
conflict ?
Right now I have a nVidia 9800GT for display and a 295GTX for GPU
computation, and I'm thinking about adding an AMD HD 7970.
I want to use them for distributed computation (multiple threads
computing different parts of a data set), so I was wondering if there
would not be any conflict:
- for pyopencl to compile kernels in // for two platforms (probablky
not, I think I already tested parallel computing with CPU+GPU)
- for amd and nvidia drivers to play along nicely...
Any advice ?
--
Vincent Favre-Nicolin http://inac.cea.fr
CEA/Grenoble Institut Nanosciences & Cryogénie
Laboratoire SP2M/Nano-structures et Rayonnement Synchrotron
17, rue des Martyrs
38054 Grenoble Cedex 9 - France
Université Joseph Fourier http://www.ujf-grenoble.fr
tél: (+33) 4 38 78 95 40 fax: (+33) 4 38 78 51 38
I'd like to make a Python wrapper for the AMD OpenCL FFT library, here:
http://developer.amd.com/libraries/appmathlibs/Pages/default.aspx
I'd like to make it so that one can use the AMD code with PyOpenCL, since
Python makes using, well, since Python makes everything much easier to use.
The problem with the AMD code, though, is that they intend for you possibly to
do a bunch of setup stuff, then issue one function call that may actually do
several kernel queue staged operations. And they adamantly refuse to alter
their interface to allow it to bake the FFT plans and just hand you back
one or more kernels with a documented interface to use to queue them up
on your own.
Here is how the user is supposed to use their library:
clAmdFftSetup( ... )
clAmdFftCreateDefaultPlan( ... )
clAmdFftSetPlanPrecision ( ... )
clAmdFftSetResultLocation( ... )
clAmdFftSetLayout( ... )
clAmdFftSetPlanBatchSize( ... )
clAmdFftSetPlanInStride( ... )
clAmdFftSetPlanOutStride( ... )
clAmdFftSetPlanDistance( ... )
clAmdFftSetPlanScale( ... )
clAmdFftBakePlan( ... )
Once the plan is baked, it can be used multiple times, with this
call:
while loop:
clAmdFftEnqueueTransform( ... )
And after it is run, you are free to enqueue your own kernel(s) and do
whatever you want with the results. For my purposes, I'd like to enqueue
a kernel that converts some of the real/imag data into the more useful
mag/phase information (as an aside, I have no idea why these people don't
already make mag/phase an option...it's weird not to have that, right?)
And I'd also like to do GL context sharing at this point.
enqueue_my_stuff( ... )
acquire GL buffers( ... )
do GL stuff ( ... )
Then once you are done, you call their finish function:
clFinish( ... )
...
clAmdFftDestroyPlan( ... )
clAmdFftTeardown()
Questions for the group: I've never done my own Python extension module. Would
I need to use Boost.Python for this to work with PyOpenCL? Is this possible
to get working, or is it a lost cause?
--Keith Brafford
Hello Guilherme,
you should only initialize the device memory once with the maximum
number of entries. Then use cl.enqueue_copy to only copy the parts that
you have to move between device and host. Copying should be faster than
the complete initialization.
Best regards,
Dieter
Am 18.05.2012 22:33, schrieb Guilherme Gonçalves Ferrari:
> Dear all,
> I am working on a O(N^2) N-body code and I decided to write everything
> in python/numpy and to use pyopencl only to accelerate the
> gravitational force calculation.
>
> In a naive approach for time integration of equations of motion one
> could simply use a constant time-step for all particles. Therefore the
> CL kernel for force calculation is always called with the same (total)
> number of particles, namely N. This works fine since the calculation
> cost is [for reasonable values of N (>10^3)] much larger than the
> communication cost. However, if one wants to use a smarter approach to
> the time integration of the system one needs to use a sort of
> individual time-step scheme (in my case, a power-of-2 block time-step
> scheme). In this case I need to call the CL kernel with a different
> number of particles N_L (N_L <= N) always that a subset of particles
> in a given time-step level L needs to be integrated.
>
> So, after make sure that everything works properly, now I am trying to
> fix some bottlenecks of my implementation. In the function for force
> calculation I do something like this:
>
> ...
> acc_kernel = kernel_library.get_kernel("p2p_acc_kernel")
> acc_kernel.set_kernel_args(*data, global_size=global_size,
> local_size=local_size,
>
> result_shape=result_shape,
>
> local_memory_shape=local_memory_shape)
> acc_kernel.run()
> result = acc_kernel.get_result()
> ...
>
> where 'data' is a tuple of numpy arrays.
> After some profiling I have figured out that the main bottleneck is in
> the 'set_kernel_args' method, where I have two calls to
> cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=hostbuf)
> and one call to
> cl.Buffer(ctx, mf.WRITE_ONLY, size=kernel_result.nbytes)
>
> I know that to call cl.Buffer always that I need to call the CL kernel
> is quite inefficient but that was the only way that I found to get the
> right result.
>
> So, my questions are:
>
> i) How can I avoid this problem without a complete refactoring of my
> code implementation (I don't want to lose too much of the flexibility
> of my code)?
>
> ii) Is it possible to allocate the maximum size [which is O(N)] of the
> device memory at beginning of the simulation and still be able to call
> the CL kernel with different values of number of particles N_L (N_L <=
> N)? If so, what I need to do to be able to call the kernel with, for
> example, number of particles N_L = N/2 or N/4 or whatever it? What
> about the 'result' array? Will I get always an O(N) 'result' array or
> it will be O(N_L)?
>
>
> thanks in advance!
> and apologize for my English...
> Guilherme
>
>
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
Dear all,
I am working on a O(N^2) N-body code and I decided to write everything in
python/numpy and to use pyopencl only to accelerate the gravitational force
calculation.
In a naive approach for time integration of equations of motion one could
simply use a constant time-step for all particles. Therefore the CL kernel
for force calculation is always called with the same (total) number of
particles, namely N. This works fine since the calculation cost is [for
reasonable values of N (>10^3)] much larger than the communication cost.
However, if one wants to use a smarter approach to the time integration of
the system one needs to use a sort of individual time-step scheme (in my
case, a power-of-2 block time-step scheme). In this case I need to call the
CL kernel with a different number of particles N_L (N_L <= N) always that a
subset of particles in a given time-step level L needs to be integrated.
So, after make sure that everything works properly, now I am trying to fix
some bottlenecks of my implementation. In the function for force
calculation I do something like this:
...
acc_kernel = kernel_library.get_kernel("p2p_acc_kernel")
acc_kernel.set_kernel_args(*data, global_size=global_size,
local_size=local_size,
result_shape=result_shape,
local_memory_shape=local_memory_shape)
acc_kernel.run()
result = acc_kernel.get_result()
...
where 'data' is a tuple of numpy arrays.
After some profiling I have figured out that the main bottleneck is in the
'set_kernel_args' method, where I have two calls to
cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=hostbuf)
and one call to
cl.Buffer(ctx, mf.WRITE_ONLY, size=kernel_result.nbytes)
I know that to call cl.Buffer always that I need to call the CL kernel is
quite inefficient but that was the only way that I found to get the right
result.
So, my questions are:
i) How can I avoid this problem without a complete refactoring of my code
implementation (I don't want to lose too much of the flexibility of my
code)?
ii) Is it possible to allocate the maximum size [which is O(N)] of the
device memory at beginning of the simulation and still be able to call the
CL kernel with different values of number of particles N_L (N_L <= N)? If
so, what I need to do to be able to call the kernel with, for example,
number of particles N_L = N/2 or N/4 or whatever it? What about the
'result' array? Will I get always an O(N) 'result' array or it will be
O(N_L)?
thanks in advance!
and apologize for my English...
Guilherme
On Mon, 14 May 2012 21:51:59 +0200, Marko Bencun <mbencun(a)gmail.com> wrote:
> Hi Andreas
>
> That is unfortunate, I can't debug it as I don't have an ATI card. I
> commited a possible fix, but it is nothing more than a guess (here is the
> diff:
> https://github.com/benma/pysph/commit/7f4a51d66702a5442e82d413480ac8db294d4…
> ).
>
> If it is not too much trouble, please get the newest version (pull or "git
> clone git@github.com:benma/pysph.git") and test again. If it still does not
> work, I would need someone with an ati card to pitch in and help debug this
> problem.
Sad to say this hasn't helped.
Thanks for trying,
Andreas
Massimo,
Have you heard of Random123? I'm currently using a personal
implementation of George Marsaglia's generators (Multiply-with-carry,
3-shift-register and congruential) within my kernel. But I am strongly
tempted to switch to Random123, also within the kernel, to cut down on
streams. My existing implementation uses 4 state variables, if I switch
that drops to 1 which in my case I would borrow from the time step
variable. The way of differentiating between kernel instances is then by
using get_global_id(0). The litterature is looking promising for it as a
random number generator.
Dave.
[1] http://www.deshawresearch.com/resources_random123.html
> From: Massimo Di Pierro<massimo.dipierro(a)gmail.com>
> To: Alex Toldayev<toldayev.alexey(a)gmail.com>
> Cc: pyopencl(a)tiker.net
> Sent: Monday, May 14, 2012 11:12 AM
> Subject: Re: [PyOpenCL] thanks for opencl
>
>
> Hello Alex,?
>
> I did not know about it. I will look into it. The code you reference does not say how is the generator initialized. For me it is critical to be able to initialize many generators and make sure the sequences do not overlap. The smallest the state, the better. Speed is not critical.
>
> massimo
>
>
> On May 14, 2012, at 1:07 PM, Alex Toldayev wrote:
>
> Hi Massimo!
>>
>>
>> Have you considered xor128/xorshift?PRG? They say it has long period and passes diehard tests. I'm curious because I also did lattice QCD calculations, and used xor128 for it. And the source code is quite simple (http://en.wikipedia.org/wiki/Xorshift#Example_Implementation), so I'd like to find out how good it is ))
>>
>>
>> Also, in xorshift the bitwise operations are used. So, I guess, it should be equally fast with integers and floats?
Dear Massimo,
On Mon, 14 May 2012 14:48:54 -0500, Massimo DiPierro <massimo.dipierro(a)gmail.com> wrote:
>
>
That email arrived empty. Can you please resend? Also, can you please
stick to the correct threads when replying to email? Those of us with
threaded mail readers (me! :) will very much appreciate it!
Thanks,
Andreas
Ok, thanks Ian and Marko, that's good to know.
cheers,
sven
Am 14.05.2012 21:33, schrieb Marko Bencun:
> Hi Sven
>
> I don't think it is trivial to modify the sorting algorithm to handle
> different array lengths other than just inserting dummy values, as Ian
> said. In pysph, I use radix sort, and I use the largest uint32 for
> padding. See here:
> https://github.com/benma/pysph/blob/7c9bb8da0e366ebc261f54a10686bbb91487cbc…
>
> Best, Marko
>
> Date: Mon, 14 May 2012 10:44:13 +0200
> From: Sven Schreiber <svetosch(a)gmx.net <mailto:svetosch@gmx.net>>
> To: pyopencl(a)tiker.net <mailto:pyopencl@tiker.net>
> Subject: Re: [PyOpenCL] pysph: a particle fluid simulation using
> pyopencl
> Message-ID: <4FB0C5DD.6000900(a)gmx.net <mailto:4FB0C5DD.6000900@gmx.net>>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi Marko,
>
> I was looking at the sorting code port/wrappers that you are using there
> to see if I could "steal" something, and I stumbled over the comment
> "only power-of-two array lengths are supported" in bitonic_sort.py.
> Since I know very little about sorting algorithms on GPUs I'm not sure
> whether that's an inherent limitation or if it could be easily
> generalized. At the very least it seems that it should be possible to
> add dummy elements to the array to get to a power-of-two length. Do you
> have any hints on how it could be made more general? Also, does the
> limitation also apply to the radix sort?
>
> Thanks for any comments,
> Sven
>
> On 05/04/2012 01:36 PM, Marko Bencun wrote:
>> Hello everyone
>>
>> I created a sph fluid simulation with Python and pyopencl. Find it here:
>> https://github.com/benma/pysph.
>>
>> I would appreciate any feedback. Also, since I have an nvidia card, I
>> would be glad if someone using an ATI card could test it and tell me
>> whether it works.
>>
>> Best, Marko
>>
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net <mailto:PyOpenCL@tiker.net>
>> http://lists.tiker.net/listinfo/pyopencl
On Sun, 13 May 2012 22:49:30 -0500, Massimo DiPierro <massimo.dipierro(a)gmail.com> wrote:
> Hello Andreas,
>
> I would like to propose the following patch to Ivar as I am not sure of the right patch submission process.
>
> The patch adds a new function:
>
> ranluxcl_init(ulong ins, global ranluxcl_state_t *ranluxcltab)
>
> ranluxcl_init allows to an explicit value for what in ranluxcl_initialization is defined by.
>
> ulong x = (ulong)RANLUXCL_MYID + (ulong)ins * ((ulong)UINT_MAX + 1);
>
> The first argument of ranluxcl_init is the value of x.
>
> This allows creation of multiple random number generators with a state unique associated to this initialization parameter, independently on the value of RANLUXCL_MYID.
>
> I need this because I have a large lattice and I need one PRNG per lattice site. The state of the PRNGs must not depend on the device of the parallelization of the problem. The algorithm is designed in such a way that each site only uses its own local PRNG. The PRNGs must be independent. I work on Lattice QCD which is the same field Martin Luscher works on and I believe this how this generator is originally intended to be used.
>
> I tried to write the patch with minimal changes and avoiding duplication of code. I have no attachment to the new function name.
>
>
> For the time being… how do I tell pyopencl to look for included files
> in the local working directory?
prg.build(options=["-I", "."])
Please share the bitbucket issue id for the patch so that I can listen
in.
Thanks!
Andreas
Hi Andreas
That is unfortunate, I can't debug it as I don't have an ATI card. I
commited a possible fix, but it is nothing more than a guess (here is the
diff:
https://github.com/benma/pysph/commit/7f4a51d66702a5442e82d413480ac8db294d4…
).
If it is not too much trouble, please get the newest version (pull or "git
clone git@github.com:benma/pysph.git") and test again. If it still does not
work, I would need someone with an ati card to pitch in and help debug this
problem.
Best, Marko
---------------
Date: Sat, 12 May 2012 15:38:30 -0400
From: Andreas Kloeckner <lists(a)informa.tiker.net>
To: <pyopencl(a)tiker.net>
Subject: Re: [PyOpenCL] pysph: a particle fluid simulation using
pyopencl
Message-ID: <58aaae4f94e79f79ace3f9bcf013cac4(a)tiker.net>
Content-Type: text/plain; charset=UTF-8; format=flowed
Am 04.05.2012 07:36, schrieb Marko Bencun:
> Hello everyone
>
> I created a sph fluid simulation with Python and pyopencl. Find it
> here: https://github.com/benma/pysph [1].
>
> I would appreciate any feedback. Also, since I have an nvidia card, I
> would be glad if someone using an ATI card could test it and tell me
> whether it works.
>
> Best, Marko
Finally got a chance to try this on my AMD Llano APU, no luck:
7290 particles
initial density: 1000, mass: 17.146776406, gas constant k: 1000,
timestep: 0.00407722018784
(20, 20, 20) 8000 cells
/home/andreas/src/pyopencl/pyopencl/__init__.py:36: CompilerWarning:
Non-empty compiler output encountered. Set the environment variable
PYOPENCL_COMPILER_OUTPUT=1 to see more.
"to see more.", CompilerWarning)
/home/andreas/pack/pysph/src/sph/sph.py:198: DeprecationWarning: struct
integer overflow masking is deprecated
np.float32(self.mass))
Traceback (most recent call last):
File "/home/andreas/pack/pysph/src/fluid_widget.py", line 44, in
initializeGL
self.sph_demo.glinit()
File "/home/andreas/pack/pysph/src/sph_demo.py", line 29, in glinit
self.fluid_simulator.cl_init()
File "/home/andreas/pack/pysph/src/sph/sph.py", line 177, in cl_init
self.cl_init_data()
File "/home/andreas/pack/pysph/src/sph/sph.py", line 205, in
cl_init_data
self.position_cl = cl.GLBuffer(ctx, mf.READ_WRITE,
int(self.position_vbo.buffers[0]))
LogicError: clCreateFromGLBuffer failed: invalid gl object
Any ideas?