Re: [PyCUDA] Thread Problem
by Andrea Cesari

Ok, understood.
Thanks!
> Date: Tue, 10 Jul 2012 23:27:27 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> On Tue, Jul 10, 2012 at 11:22 PM, Andrea Cesari
> <andrea_cesari(a)hotmail.it> wrote:
> > If i understood correctly dt.dtype_to_ctype(type) tell me the corresponding
> > variable type on python?
>
> Not python per se, but numpy types (the ones you get from numpy
> arrays' dtype field). dtype_to_ctype() takes numpy datatype object and
> returns C equivalent. NAME_TO_DTYPE provides correspondence in other
> direction, i.e. from C types to numpy datatypes.
7 years, 1 month

Re: [PyCUDA] Thread Problem
by Andrea Cesari

Oh great!!
Now all works fine!
Thanks for the advice.
If i understood correctly dt.dtype_to_ctype(type) tell me the corresponding variable type on python?
> Date: Tue, 10 Jul 2012 23:07:07 +1000
> Subject: Re: [PyCUDA] Thread Problem
> From: mantihor(a)gmail.com
> To: andrea_cesari(a)hotmail.it
> CC: pycuda(a)tiker.net
>
> Hi Andrea,
>
> On Tue, Jul 10, 2012 at 10:41 PM, Andrea Cesari
> <andrea_cesari(a)hotmail.it> wrote:
> > dest=numpy.zeros(lung_vett,dtype=numpy.int16);
>
> Should be int32. When in doubt, use
> pycuda.compyte.dtypes.dtype_to_ctype() and
> pycuda.compyte.dtypes.NAME_TO_DTYPE (there's no ctype_to_dtype()
> function at the moment):
>
> >>> import numpy
> >>> import pycuda.autoinit
> >>> import pycuda.compyte.dtypes as dt
> >>> dt.dtype_to_ctype(numpy.float32)
> 'float'
> >>> dt.dtype_to_ctype(numpy.int32)
> 'int'
> >>> dt.dtype_to_ctype(numpy.int16)
> 'short'
> >>> dt.NAME_TO_DTYPE['int']
> dtype('int32')
7 years, 1 month

Thread Problem
by Andrea Cesari

Hi,
i'm new in PyCuda.
I should implement a cross-correlation in CUDA.
Doing this i encountered some problem. So i'm doing some basic samples.
For example, if i do this :
import pycuda.autoinit
import pycuda.driver as drv
import numpy
from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void thread_index(float *dest)
{
int i = threadIdx.x;
dest[i]=i;
}
""")
lung_vett=10;
thread_index = mod.get_function("thread_index")
dest=numpy.zeros(lung_vett);
thread_index(drv.Out(dest),block=(lung_vett,1,1))
print dest
I expect that dest is like this: dest= [0,1,2,3,4,5,6,7,8,9,10].
But pycuda returns random number like this : [ 7.81250000e-003 3.20000076e+001 2.04800049e+003 3.27680079e+004
2.62144063e+005 1.76718726e+300 2.39291672e+300 1.11420383e+282
2.23435632e+297 7.47372270e+294].
I wrote the same script in C and the result is correct. That is to say : dest= [0,1,2,3,4,5,6,7,8,9,10].
where is my error?
Excuse me for my horrible english but i'm an italian student.
Thank to all.
7 years, 1 month

Re: [PyCUDA] Do tasks run in background?
by Orestis K

Thank you marmaduke! For points 1 and 2, that's what i've noticed too.
-Oresits
Date: Fri, 6 Jul 2012 13:42:32 +0200
Subject: Re: [PyCUDA] Do tasks run in background?
From: duke(a)eml.cc
To: orekost(a)hotmail.com
CC: pycuda(a)tiker.net
Hi
You may want to hold out for a more authoritative response from someone else, but I have noticed and write my code assuming that
- func() will launch the kernel and return (almost) immediately
- attempts to access gpuarrays involved in a launched kernel will block until launched kernel has completed
- pycuda.driver.Context.synchronize can be called to explicitly wait for kernel launch to complete (which is useful if you have two kernels operating on same data, as they could otherwise run simultaneously)
cheers
Marmaduke
On Fri, Jul 6, 2012 at 11:39 AM, Orestis K <orekost(a)hotmail.com> wrote:
Hello everyone!
I'm new to PyCUDA and GPU programming however initial experiences have been very pleasant. I started out by some simple task and it seems blazing faster than running on a CPU. However, I would like to confirm that it's indeed as fast as it seems.
My main question is whether after 'func' is called and access of the prompt is regained, are there still any of the tasks running on the GPU? If so, is there a way to block from performing the next tasks until it has finished?
I've posted the code below for reference purposes. You can change the value of N so that it's faster. I set it very close to the limit so that I might witness a delay on returning control of the command prompt.
Thank you in advance and please keep up the excellent work!
-Orestis
=================================================================
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gpuarray
import sys, numpy, random, string
# create random input data
N = 33500000
buf = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for x in xrange(N))
mod = SourceModule("""
__global__ void get_words(int N, char *a,unsigned int *b)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ( idx <N-3)
{
b[idx] = (a[idx] << 24) + (a[idx+3]);
}
}
""")
func = mod.get_function("get_words")
# copy buffer to GPU
bufArray = cuda.mem_alloc(N)
cuda.memcpy_htod(bufArray, buf)
# create results array on GPU
resArray = gpuarray.to_gpu(numpy.zeros((N-3,1),dtype=numpy.int32))
# setup parameters and execute function
threadsPerBlock = 512
blocksPerGrid = (N+threadsPerBlock-1)/threadsPerBlock
func(numpy.int32(len(buf)), bufArray, resArray, grid=(blocksPerGrid ,1), block=(threadsPerBlock,1,1))
# get back results
a = numpy.zeros(N-3,1),dtype=numpy.int32)
b = resArray.get(a)
a = a.reshape(-1).tolist()
_______________________________________________
PyCUDA mailing list
PyCUDA(a)tiker.net
http://lists.tiker.net/listinfo/pycuda
7 years, 1 month

Do tasks run in background?
by Orestis K

Hello everyone!
I'm new to PyCUDA and GPU programming however initial experiences have been very pleasant. I started out by some simple task and it seems blazing faster than running on a CPU. However, I would like to confirm that it's indeed as fast as it seems.
My main question is whether after 'func' is called and access of the prompt is regained, are there still any of the tasks running on the GPU? If so, is there a way to block from performing the next tasks until it has finished?
I've posted the code below for reference purposes. You can change the value of N so that it's faster. I set it very close to the limit so that I might witness a delay on returning control of the command prompt.
Thank you in advance and please keep up the excellent work!
-Orestis
=================================================================
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gpuarray
import sys, numpy, random, string
# create random input data
N = 33500000
buf = ''.join(random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits) for x in xrange(N))
mod = SourceModule("""
__global__ void get_words(int N, char *a,unsigned int *b)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if ( idx <N-3)
{
b[idx] = (a[idx] << 24) + (a[idx+3]);
}
}
""")
func = mod.get_function("get_words")
# copy buffer to GPU
bufArray = cuda.mem_alloc(N)
cuda.memcpy_htod(bufArray, buf)
# create results array on GPU
resArray = gpuarray.to_gpu(numpy.zeros((N-3,1),dtype=numpy.int32))
# setup parameters and execute function
threadsPerBlock = 512
blocksPerGrid = (N+threadsPerBlock-1)/threadsPerBlock
func(numpy.int32(len(buf)), bufArray, resArray, grid=(blocksPerGrid ,1), block=(threadsPerBlock,1,1))
# get back results
a = numpy.zeros(N-3,1),dtype=numpy.int32)
b = resArray.get(a)
a = a.reshape(-1).tolist()
7 years, 1 month

PyCUDA poor FP32 performance on Fermi ?
by Roberto Colistete Jr.

Hi,
It is my first post here in this PyCUDA group. I am using PyCUDA x
CUDA x Mathematica 8 CUDA to compare performance in some problems in
Physics.
Until CC 1.3, the performance ratio of PyCUDA between DP/SP
(FP64/FP32) was as expected (near 1/8 or 1/12), comparable when running
CUDA or Mathematica 8 CUDA.
But using the same source code on any GPU device with CC 2.0/2.1
(Fermi), the performance in FP32 (SP) is poor with :
- DP/SP ratio of approx. 1/3 to 1/2;
- better GPU device (Tesla C2050, CC2.0) being slower (0.77s x 0.33s) in
FP32 than older GPU (Tesla C1060, CC1.3)), while in FP64 it is faster
(0.89s x 4.48s).
The same behaviour happens with other CC2.x GPU devices (GTX 480,
GT 540M, etc) and any Linux (Ubuntu, Fedora, etc).
Do you have some explanation about this issue ? And recomendation
to solve it ?
Regards,
Roberto
7 years, 1 month