On Mon, 30 Mar 2015 12:46:42 -0400
Ananth Sridharan <ananth(a)umd.edu> wrote:
> > I have a simulation code which requires the use of multiple kernels. Each
> of these kernels (global functions) needs to call a common set of device
> functions. To organize code better, I'd like to provide multiple source
> modules - one (or more) for the kernels, and one for the common
> dependencies.
Sometimes I concatenate sources (as python strings) before compilation ... in OpenCL.
This allows me to have less source code to maintain and a better factorisation.
This trick applies to your case a well.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF

> Hi,
> I have a simulation code which requires the use of multiple kernels. Each
of these kernels (global functions) needs to call a common set of device
functions. To organize code better, I'd like to provide multiple source
modules - one (or more) for the kernels, and one for the common
dependencies.
>
> I'm missing the syntax (if it exists) to let the source module containing
the kernels "know" the functions in the source module containing the device
functions. Can someone help me out?
> (I'm a pyCuda novice, and have basic working knowledge of
cuda-c/cuda-fortran)
>
Ananth

One other tip on setting up your program is to remember to reduce memory
accesses as much as possible, so try to maximize the computations you
perform for every memory transfer. So you'll probably want to load a large
chunk of tf and compute on several indicies of Mf and farray.
Craig
On Sat, Mar 28, 2015 at 8:19 PM Craig Stringham <stringham(a)mers.byu.edu>
wrote:
> Hi Bruce,
> That's an excellent problem for a GPU. However, because each problem uses
> a fair amount of memory being careful about how the memory is accessed will
> dominate your performance gains (as is typical when using a GPU). For
> example tf won't fit in the shared memory or cache of a multi-processor so
> you'll also want to divide the problem again.
> If you don't need to get this working for routine usage though, you might
> just try using numba primitives to move it to a GPU. I haven't used them,
> so I can't attest that it will give you a good answer. On the other hand,
> this is the sort of problem that makes learning CUDA and PyCUDA easy, so
> you might as well give it a shot.
> Regards,
> Craig
>
> On Sat, Mar 28, 2015 at 8:29 AM Bruce Labitt <bdlabitt(a)gmail.com> wrote:
>
>> From reading the documentation, I am confused if paralleling of this kind
>> of function is worth doing in pycuda.
>>
>> I'm trying to add the effect of phase noise in to a radar simulation.
>> The simulation is written in Scipy/numpy. Currently I am using joblib to
>> run multiple cores. It is too slow for the scenarios I wish to try. It
>> does work for a small number of targets and reduced phase noise array
>> sizes. The following is the current approach:
>>
>> Function to parallelize
>>
>> def MSIN( farray, Mf, tf, jj ):
>> """
>> farray, Mf, tf, ii
>>
>> farray array of frequencies (size = 10000)
>> Mf array of coefficients (size = 10000)
>> tf 2D array ~[2048 x 256] of time
>> jj list of indices (fraction of the problem to solve)
>>
>> """
>> Msin = 0.0
>> for ii in jj:
>> Msin = Msin + Mf[ii] * 2.0*cos( 2.0*pi*farray[ii]*tf )
>> return Msin
>>
>> Current method to call function in parallel (multiprocessing)
>>
>> """
>> ====================================================
>> Parallel computes the function MSIN with njobs cores
>> ====================================================
>> """
>> MMM = Parallel(n_jobs=njobs, max_nbytes=None)\
>> (delayed(MSIN)( f, aa, tf1, ii ) for ii in idx)
>> Msin = reduce(add, MMM) # add all the results of the cores together
>>
>> Any suggestions to port this to pycuda? Reasonable candidate?
>>
>> In essence, it is accumulating a scalar weighted cos function for many
>> elements of a 2D array. It 'feels' like it should be portable. Any road
>> blocks forseen? The 2D array of times is continuous in the sense of
>> stride. But there are discontinuous jumps in time values in the array,
>> which I do not think is a problem.
>>
>> I have from DumpProperties.py
>> Device #0: GeForce GTX 680M
>> Compute Capability: 3.0
>> Total Memory: 4193984 KB
>> CAN_MAP_HOST_MEMORY: 1
>> CLOCK_RATE: 758000
>> MAX_BLOCK_DIM_X: 1024
>> MAX_BLOCK_DIM_Y: 1024
>> MAX_BLOCK_DIM_Z: 64
>> MAX_GRID_DIM_X: 2147483647
>> MAX_GRID_DIM_Y: 65535
>> MAX_GRID_DIM_Z: 65535
>>
>> CUDA6.5
>>
>> Thanks in advance for any insight, or suggestions on how to attack the
>> problem
>>
>> -Bruce
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA(a)tiker.net
>> http://lists.tiker.net/listinfo/pycuda
>>
>

Hi Bruce,
That's an excellent problem for a GPU. However, because each problem uses a
fair amount of memory being careful about how the memory is accessed will
dominate your performance gains (as is typical when using a GPU). For
example tf won't fit in the shared memory or cache of a multi-processor so
you'll also want to divide the problem again.
If you don't need to get this working for routine usage though, you might
just try using numba primitives to move it to a GPU. I haven't used them,
so I can't attest that it will give you a good answer. On the other hand,
this is the sort of problem that makes learning CUDA and PyCUDA easy, so
you might as well give it a shot.
Regards,
Craig
On Sat, Mar 28, 2015 at 8:29 AM Bruce Labitt <bdlabitt(a)gmail.com> wrote:
> From reading the documentation, I am confused if paralleling of this kind
> of function is worth doing in pycuda.
>
> I'm trying to add the effect of phase noise in to a radar simulation. The
> simulation is written in Scipy/numpy. Currently I am using joblib to run
> multiple cores. It is too slow for the scenarios I wish to try. It does
> work for a small number of targets and reduced phase noise array sizes.
> The following is the current approach:
>
> Function to parallelize
>
> def MSIN( farray, Mf, tf, jj ):
> """
> farray, Mf, tf, ii
>
> farray array of frequencies (size = 10000)
> Mf array of coefficients (size = 10000)
> tf 2D array ~[2048 x 256] of time
> jj list of indices (fraction of the problem to solve)
>
> """
> Msin = 0.0
> for ii in jj:
> Msin = Msin + Mf[ii] * 2.0*cos( 2.0*pi*farray[ii]*tf )
> return Msin
>
> Current method to call function in parallel (multiprocessing)
>
> """
> ====================================================
> Parallel computes the function MSIN with njobs cores
> ====================================================
> """
> MMM = Parallel(n_jobs=njobs, max_nbytes=None)\
> (delayed(MSIN)( f, aa, tf1, ii ) for ii in idx)
> Msin = reduce(add, MMM) # add all the results of the cores together
>
> Any suggestions to port this to pycuda? Reasonable candidate?
>
> In essence, it is accumulating a scalar weighted cos function for many
> elements of a 2D array. It 'feels' like it should be portable. Any road
> blocks forseen? The 2D array of times is continuous in the sense of
> stride. But there are discontinuous jumps in time values in the array,
> which I do not think is a problem.
>
> I have from DumpProperties.py
> Device #0: GeForce GTX 680M
> Compute Capability: 3.0
> Total Memory: 4193984 KB
> CAN_MAP_HOST_MEMORY: 1
> CLOCK_RATE: 758000
> MAX_BLOCK_DIM_X: 1024
> MAX_BLOCK_DIM_Y: 1024
> MAX_BLOCK_DIM_Z: 64
> MAX_GRID_DIM_X: 2147483647
> MAX_GRID_DIM_Y: 65535
> MAX_GRID_DIM_Z: 65535
>
> CUDA6.5
>
> Thanks in advance for any insight, or suggestions on how to attack the
> problem
>
> -Bruce
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>

>From reading the documentation, I am confused if paralleling of this kind
of function is worth doing in pycuda.
I'm trying to add the effect of phase noise in to a radar simulation. The
simulation is written in Scipy/numpy. Currently I am using joblib to run
multiple cores. It is too slow for the scenarios I wish to try. It does
work for a small number of targets and reduced phase noise array sizes.
The following is the current approach:
Function to parallelize
def MSIN( farray, Mf, tf, jj ):
"""
farray, Mf, tf, ii
farray array of frequencies (size = 10000)
Mf array of coefficients (size = 10000)
tf 2D array ~[2048 x 256] of time
jj list of indices (fraction of the problem to solve)
"""
Msin = 0.0
for ii in jj:
Msin = Msin + Mf[ii] * 2.0*cos( 2.0*pi*farray[ii]*tf )
return Msin
Current method to call function in parallel (multiprocessing)
"""
====================================================
Parallel computes the function MSIN with njobs cores
====================================================
"""
MMM = Parallel(n_jobs=njobs, max_nbytes=None)\
(delayed(MSIN)( f, aa, tf1, ii ) for ii in idx)
Msin = reduce(add, MMM) # add all the results of the cores together
Any suggestions to port this to pycuda? Reasonable candidate?
In essence, it is accumulating a scalar weighted cos function for many
elements of a 2D array. It 'feels' like it should be portable. Any road
blocks forseen? The 2D array of times is continuous in the sense of
stride. But there are discontinuous jumps in time values in the array,
which I do not think is a problem.
I have from DumpProperties.py
Device #0: GeForce GTX 680M
Compute Capability: 3.0
Total Memory: 4193984 KB
CAN_MAP_HOST_MEMORY: 1
CLOCK_RATE: 758000
MAX_BLOCK_DIM_X: 1024
MAX_BLOCK_DIM_Y: 1024
MAX_BLOCK_DIM_Z: 64
MAX_GRID_DIM_X: 2147483647
MAX_GRID_DIM_Y: 65535
MAX_GRID_DIM_Z: 65535
CUDA6.5
Thanks in advance for any insight, or suggestions on how to attack the
problem
-Bruce

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dear PyCUDA community,
in lack of finding a better place, I'll submit a tentative "bug"
report here. I have posted my situation on stackoverflow:
https://stackoverflow.com/questions/28829536/pycuda-test-cumath-py-fails-on…
- ---
Essentially, I've installed pycuda on a machine featuring a TESLA
C2075. I'm running on Ubuntu 14.04 with the CUDA-6.0 compiler installed.
Using python 2.7.9 (via the anaconda distribution) and numpy 1.9.0, I
have installed pycuda 2014.1 from the ZIP file that Andreas Kloeckner
provides on his website. (http://mathema.tician.de/software/pycuda/)
Running the tests provided by that ZIP file goes all well except for
the test_cumath.py file. I receive the following error:
E AssertionError: (2.3841858e-06, 'cosh', <type 'numpy.complex64'>)
E assert <built-in method all of numpy.bool_ object at 0x7f00747f3880>()
E + where <built-in method all of numpy.bool_ object at
0x7f00747f3880> = 2.3841858e-06 <= 2e-06.all
test_cumath.py:54: AssertionError
===== 1 failed, 27 passed in 12.57 seconds =====
- ---
I can consistently reproduce the same number (and error) on the four
different C2075 that I have available. Is this simply a tolerance that
has not been set appropriately?
I guess, I am not the first one to run this test, so I'm wondering
whether I am doing something wrong... ;-)
Please point me to any better place to ask this question if this
should be the wrong address.
Thanks for your help,
Adrian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAEBAgAGBQJU9dVtAAoJELJrqqxor1RqWaIH/RFC9oAF3me+2HfQlX3H5uM0
jEDheeJ7yoR13uSkm5eVNd/dZr/vqzsCgHshYkslZy09eL2qsvytvXb279kB+9FL
oPUDxywICBad1h0RjHdC2vgrNJmyQcjLUcl7fnvjp+obtKR67lRsQDkcQsRLSsRe
649TS5kkdb2ukDPv8XoikVi5Lr/mZ7J6HHOkr9+e+lD8megOhOwYxQQVSRtt2XJQ
TR7PJh0ycn/nKe1ksSOl9KXTZuTbQQ4w7g4n0SiiYra0nZTXd+oEFOxqMDiuJ89h
YFcN1wv6TxP8ZaWKhlffOeGrZHQh5PmusDW0EQBTME10HqNKg9gKXxJOmRhS95s=
=rwVG
-----END PGP SIGNATURE-----