Dear all,
we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
training points.
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> pyopencl-2013.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py",
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
Host-Python-File
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem[0], 4 bytes cmem[3]
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
mistake.
We would be very grateful for every hint.
Best regards,
Justin Heinermann,
University Oldenburg
Dear Python/OpenCL community,
I am pretty new (py)opencl and encountered a problem, maybe it a lack of understanding of openCL, but I found strange python seg-faults:
test program:
#!/usr/bin/python
import numpy, pyopencl
ctx = pyopencl.create_some_context()
data=numpy.random.random((1024,1024)).astype(numpy.float32)
img = pyopencl.image_from_array(ctx, ary=data, mode="r", norm_int=False, num_channels=1)
print img
System: debian sid: pyopencl2012.1 (the same code works with debian stable and v2011.2)
Here is the backtrace obtained with GDB:
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff340c253 in pyopencl::create_image_from_desc(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#2 0x00007ffff342de36 in _object* boost::python::detail::invoke<boost::python::detail::install_holder<pyopencl::image*>, pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>, boost::python::arg_from_python<unsigned long>, boost::python::arg_from_python<_cl_image_format const&>, boost::python::arg_from_python<_cl_image_desc&>, boost::python::arg_from_python<boost::python::api::object> >(boost::python::detail::invoke_tag_<false, false>, boost::python::detail::install_holder<pyopencl::image*> const&, pyopencl::image* (*&)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>&, boost::python::arg_from_python<unsigned long>&, boost::python::arg_from_python<_cl_image_format const&>&, boost::python::arg_from_python<_cl_image_desc&>&, boost::python::arg_from_python<boost::python::api::object>&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#3 0x00007ffff342e06f in boost::python::detail::caller_arity<5u>::impl<pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::detail::constructor_policy<boost::python::default_call_policies>, boost::mpl::vector6<pyopencl::image*, pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object> >::operator()(_object*, _object*) ()
from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#4 0x00007ffff311715b in boost::python::objects::function::call(_object*, _object*) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#5 0x00007ffff3117378 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
#6 0x00007ffff3120593 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#7 0x00007ffff3445983 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<pyopencl::error, void (*)(pyopencl::error const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(pyopencl::error const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#8 0x00007ffff3120373 in boost::python::handle_exception_impl(boost::function0<void>) ()
from /usr/lib/libboost_python-py27.so.1.49.0
#9 0x00007ffff3115635 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
Thanks for your help.
If you are not able to reproduce this bug, I should mention it to debian.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF
Hi,
Is there a particular reason why when I declare a read-only image as
input to my kernel, I have to make the input data buffer read-write?
Attached is a test case, you can find a copy online here:
https://www.xpra.org/trac/browser/xpra/trunk/src/tests/xpra/codecs/test_ope…
And here is the output, it tries various buffer types:
ERR: <type 'str'> buffer fails: Cannot use string as modifiable buffer
OK: <class 'ctypes.c_char_Array_65536'> buffer
OK: <type 'buffer'> buffer
ERR: <type 'buffer'> buffer fails: buffer is read-only
Since the memory is only ever used as input, I would much prefer keeping
it in a read-only buffer.
This a bit inconvenient for us as PyOpenCL is just one of many data
consumers we have, and I've had to make all the memory buffers
read-write, which potentially exposes us to more bugs.
Is there a known workaround, or anything we can do to not require
read-write memory?
Cheers
Antoine
________________________________________
From: Joe Haywood
Sent: Monday, December 23, 2013 10:18 AM
To: pyopencl(a)tiker.net
Subject: RE: openCL your thoughts
I had a chance to do some profiling with the AMD Codexl and Nvidia's visual profiler. Unfortunately the Nvidia profiler doesn't work with my pyopencl program, only with the original CUDA program I created. However, both profilers say kernel occupancy is low, the AMD profiler of the pyopencl code is 12.5% kernel occupancy. The NVidia CUDA profiler is ~20% kernel occupancy. Here is the code to launch the kernel in pyopencl, any ideas how to make the kernel occupancy go up, or is it an effort in futility since the kernel is so demanding per thread?
Pyopencl kernel launching, nMP and nthreads are anything from (96,682) to (128,1024). The values of (128,1024) make the code run the fastest, 40 seconds, on my AMD hardware, the values of (96,682) make it run fastest (16 seconds) on my NVidia hardware. I can get the times down to 35, and 15, respectively by using RANLUXCL_LUX=2 in the compile. But even then the NVidia opencl version is 5 seconds slower then my original CUDA implementation, which only takes 10 seconds. Note, the for loop as written gives the fastest code, moving anything around, i.e. taking out the enqueue copies, or only initializing the random number generator once all make the code take longer to run!?!?
****
platformc=cl.get_platforms()[0]
devicec=platformc.get_devices(cl.device_type.GPU)[0]
ctxc=cl.Context([devicec])
prgc=cl.Program(ctxc,calccode).build(options="-cl-mad-enable -cl-fast-relaxed-math")
cleMC=prgc.cleMC
cleMC.set_scalar_arg_dtypes([None, None, None,None,None,None,np.int32,np.float32,\
np.float32,np.float32,np.float32,np.int32,None,None,np.int32,np.int32,np.int32,\
np.float32,np.float32,np.float32,np.float32,np.float32,np.float32,np.float32,np.float32,np.float32,None])
RanInit=prgc.Kernel_Ranluxcl_Init
RanInit.set_scalar_arg_dtypes([np.uint32,None])
queuec=cl.CommandQueue(ctxc,properties=cl.command_queue_properties.PROFILING_ENABLE)
pdose=np.zeros_like(dose3)
dens3_buf=cl.Buffer(ctxc,mf.READ_ONLY ,dens3.nbytes)
cl.enqueue_write_buffer(queuec,dens3_buf,dens3,is_blocking=True)
cax_buf=cl.Buffer(ctxc,mf.READ_ONLY,cax.nbytes)
cl.enqueue_write_buffer(queuec,cax_buf,cax,is_blocking=True)
cay_buf=cl.Buffer(ctxc,mf.READ_ONLY,cay.nbytes)
cl.enqueue_write_buffer(queuec,cay_buf,cay,is_blocking=True)
caz_buf=cl.Buffer(ctxc,mf.READ_ONLY,caz.nbytes)
cl.enqueue_write_buffer(queuec,caz_buf,caz,is_blocking=True)
pdose_buf=cl.Buffer(ctxc,mf.READ_WRITE,pdose.nbytes)
cl.enqueue_write_buffer(queuec,pdose_buf,pdose,is_blocking=True)
bxx_buf=cl.Buffer(ctxc,mf.READ_ONLY,bxx.nbytes)
cl.enqueue_write_buffer(queuec,bxx_buf,bxx,is_blocking=True)
byy_buf=cl.Buffer(ctxc,mf.READ_ONLY,byy.nbytes)
cl.enqueue_write_buffer(queuec,byy_buf,byy,is_blocking=True)
dose3_buf=cl.Buffer(ctxc,mf.READ_WRITE,dose3.nbytes)
cl.enqueue_write_buffer(queuec,dose3_buf,dose3,is_blocking=True)
indx1=wher(cax,strt,NCOL);indx2=wher(cax,sstop,NCOL);
indy1=wher(caz,strt,NZ);indy2=wher(caz,sstop,NZ);
nelec=int((indx2-indx1+1))*int((indy2-indy1+1))*int(npcell);
nMP=int(128);nthreads=int(128);
state_buf=cl.Buffer(ctxc,mf.READ_WRITE,nMP*nthreads*112)
print "nelec= %g"%(nelec)
RanInit(queuec,(nMP,nthreads,),None,np.uint32(seedR),state_buf)
nloop=(nelec/(nMP*nthreads))+1;
print "nloop",nloop,nelec,nthreads
ndose3=np.zeros_like(dose3)
totelec=0;
print "cymin= ",cymin
for k in xrange(nloop):
print "kernel loop ",float(k)*100/float(nloop)
#dose3[:]=0.0;pdose[:]=0.0;
seedR=np.uint32(np.random.randint(low=40,high=1e6))
RanInit(queuec,(nMP,nthreads,),None,seedR,state_buf)
cleMC(queuec,(nMP,nthreads,),None,cax_buf,cay_buf,caz_buf,dose3_buf,dens3_buf,pdose_buf,np.int32(nelec),\
np.float32(A),np.float32(alp),np.float32(E0),np.float32(bnc),np.int32(nblk),bxx_buf,byy_buf,np.int32(NROW),np.int32(NCOL),np.int32(NZ),\
np.float32(sstop),np.float32(strt),np.float32(cxmin),np.float32(cxmax),np.float32(cymin),np.float32(cymax),np.float32(czmin),np.float32(czmax),\
np.float32(phper),state_buf)
totelec+=nMP*nthreads
cl.enqueue_copy(queuec,dose3,dose3_buf)
cl.enqueue_copy(queuec,pdose,pdose_buf)
*************
________________________________________
From: PyOpenCL [pyopencl-bounces(a)tiker.net] on behalf of Joe Haywood [haywoojr(a)mercyhealth.com]
Sent: Wednesday, December 18, 2013 10:50 AM
To: pyopencl(a)tiker.net
Subject: Re: [PyOpenCL] openCL your thoughts
I have not done the profiling using OpenCL tools, only the buitlin Python timer. I will give this a try shortly and report back.
Someone else asked for a code sample, the main loop is presented below (If it needs some sort of formatting tags I don't know how to use them, I apologize in advance :-) Warning I am a physicist, and I program like one. The code is supposed to move an electron randomly through a material grid and calculate the energy deposited in the grid. Thanks in advance.
Joe
/*************************/
inline void GAtomicAdd(volatile __global float *source, const float operand) {
union {
unsigned int intVal;
float floatVal;
} newVal;
union {
unsigned int intVal;
float floatVal;
} prevVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while (atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal, newVal.intVal) != prevVal.intVal);
}
/*Main calculation engine*/
__kernel void cleMC( __global const float *x, __global const float *y, __global const float *z, __global float * dose, \
__global const float * dens,__global float * pdose, int nelec , \
const float A, \
const float alp, const float En0, const float bnc,\
const int nvert,__global const float *vertx, __global const float *verty, const int NROW,const int NCOL,const int NZ, \
const float sstp,const float strt,const float xmin, const float xmax, const float ymin, \
const float ymax, const float zmin, const float zmax,const float phper,__global ranluxcl_state_t *ranluxcltab){
float minSe=0.01f,e0,delE,rnd;
float4 rlx;
float ex, ey, ez, edx, edy, edz,vssd=90.0f;
float theta,the1,the2,the3,urn,dlo,dl,rn,r1,fac,afac=1.2e-3f,cfac,sfc,sfr,sfy;
float dx,dy,dz;
float vec[3],tdl;
//curandState localstate;
int indx,indy,indz,notin;
int ph=0,lid,i,init;
//random numbers stuff
//ranluxclstate stores the state of the generator.
ranluxcl_state_t ranluxclstate;
//Download state into ranluxclstate struct.
ranluxcl_download_seed(&ranluxclstate, ranluxcltab);
/*based on the initial conditions will choose which direction is the smallest, not necessary as of 10-10-13 since all dimensions are equal from python*/
dx=fabs(x[1]-x[0]);dy=fabs(y[1]-y[0]);dz=fabs(z[1]-z[0]);
if(dx<=dy && dx<=dz){
dlo=dx;
}else if(dy<=dx && dy<=dz){
dlo=dy;
}else if(dz<=dx && dz<=dy){
dlo=dz;
}
/*minimum energy electron can have from the range tables that will just cross the volume, ScsdaSe defined in header file incs2f.h, minSe minimum energy from tables*/
minSe=0.01f;
/*defines the unique thread number using CUDA variables*/
//i=threadIdx.x+blockIdx.x*blockDim.x;
init=get_local_id(0)+get_group_id(0)*get_local_size(0);
lid=get_local_id(0);
/*begin electron loop, generate enough electrons to satisfy the requested number from python*/
/*choose an energy from a normal distribution of spread alp and mean E0*/
rlx=ranluxcl32norm(&ranluxclstate);
e0=rlx.x*alp+En0;
/*notin defines whether the electron is in the grid, or being blocked by the electron cutout*/
notin=1;
/*choose random positions on the grid in the beams eye view*/
rlx=ranluxcl32(&ranluxclstate);
ex=rlx.x*(sstp-strt)-(sstp-strt)/2.0f;
ex+=rlx.y*0.01f-0.005f;
ez=rlx.z*(sstp-strt)-(sstp-strt)/2.0f;
ez+=rlx.w*0.01f-0.005f;
/*see incs2f.h for pnpoly, is the elctron in the grid and in the apeture of the block*/
notin=pnpoly(nvert,vertx,verty,ex,ez);
if (notin==0) e0=0.0f;
//ey is ymin but testing to have it at iso
ey=ymin;rn=bnc;dl=rn*dlo;
minSe=ScsdaSe(dl);
if(minSe<0.01f)minSe=0.01f;
urn=sqrt(ex*ex+ez*ez+vssd*vssd);
/*stepsize pseudo velocity vector, used to step the electrons through the grid*/
edy=dl*vssd/urn;edx=dl*ex/urn;edz=dl*ez/urn;
//is this a photon
/*phper from python is the photon percentage needed to match the machine measurements is random number is less than the percentage then make this a photon and not an electron*/
rlx=ranluxcl32(&ranluxclstate);
if(rlx.x<phper)ph=1;
while(e0>minSe){
/*if the electron has wandered off the grid exit the loop*/
if(ez>zmax || ez<zmin||ex>xmax||ex<xmin||ey>ymax||ey<ymin)break;
/*find which array index the electron is in*/
indx=d_wher(x,ex,NCOL);indy=d_wher(y,ey,NROW);
indz=d_wher(z,ez,NZ);
//check the density we are currently in
fac=dens[indz+indx*NZ+indy*NZ*NCOL];
/*scale the water stopping powers etc to the current material*/
sfc=scalSeSc(e0,fac);
sfr=scalSeSr(e0,fac);
sfy=scalSeSy(e0,fac);
//scattering first
/*scatter the electron for next iteration*/
tdl=0.0f;
vec[0]=edx;vec[1]=edy;vec[2]=edz;
/*adjust for photon vs electron*/
if(ph==1){cfac=afac;}else{cfac=fac;}
/*CUDA notation for theta=sqrt(fac*dl)*(A*PI/(180))*/
theta=sqrt(fac*dl)*(A*PI/180.0f);
/*for photon or air density scattring angle is larger by 3.75 times, experimentally determined*/
if(fac==afac){theta*=3.75f;}
rlx=ranluxcl32(&ranluxclstate);
urn=rlx.x*2.0f-1.0f;
/*find the direction with the largest magnitude, scater around it randomly choosing one of the other two directions*/
if(edx >= edy && edx >= edz){
the2=(theta*urn);
the3=(theta*urn);
urn=rlx.y;
if(urn<=0.5f){
edx=(vec[0]*native_cos(the2)+vec[2]*native_sin(the2));
edz=(vec[2]*native_cos(the2)-vec[0]*native_sin(the2));
}else{
edx=(vec[0]*native_cos(the3)-vec[1]*native_sin(the3));
edy=(vec[1]*native_cos(the3)+vec[0]*native_sin(the3));
}
}else if(edy >= edx && edy>=edz){
the1=theta*urn;
the3=theta*urn;
urn=rlx.y;
if(urn<=0.5f){
edy=(vec[1]*native_cos(the1)-vec[2]*native_sin(the1));
edz=(vec[2]*native_cos(the1)+vec[1]*native_sin(the1));
}else{
edx=(vec[0]*native_cos(the3)-vec[1]*native_sin(the3));
edy=(vec[1]*native_cos(the3)+vec[0]*native_sin(the3));
}
}else if(edz>=edx && edz>=edy){
the2=theta*urn;
the1=theta*urn;
urn=rlx.y;
if(urn<=0.5f){
edy=(vec[1]*native_cos(the1)-vec[2]*native_sin(the1));
edz=(vec[2]*native_cos(the1)+vec[1]*native_sin(the1));
}else{
edx=(vec[0]*native_cos(the2)+vec[2]*native_sin(the2));
edz=(vec[2]*native_cos(the2)-vec[0]*native_sin(the2));
}
}
/*move the elctron to the next position*/
ex+=edx;ey+=edy;ez+=edz;
/*total distance moved, CUDA for sqrt(edx*edx+edy*edy+edz*edz);*/
tdl=sqrt(edx*edx+edy*edy+edz*edz);
/*choose radiative interaction or collisional interaction*/
r1=(sfy*SeSy(e0));
rlx=ranluxcl32(&ranluxclstate);
if(rlx.x<r1){
//change in energy by moving through with radiative interaction SeSr
//CUDA for delE=fac*sfr*SeSr(e0)*tdl;
delE=cfac*sfr*SeSr(e0)*tdl;
if(delE>e0)delE=e0;
e0-=delE;
//add the lost energy to the photon dose array
urn=delE/cfac;
//urn*=10000.0f;
GAtomicAdd(&pdose[indz+indx*NZ+indy*NZ*NCOL],urn);
//use fixed point to see if it helps speed, it doesn't so removed
//atomic_add(&pdose[indz+indx*NZ+indy*NZ*NCOL],(int)urn);
if(e0<minSe || delE<0.0f)break;
} else {
//change in energy by moving through with collisional interaction SeSc
//CUDA for delE=sfc*SeSc(e0)*tdl*fac;
delE=sfc*SeSc(e0)*tdl*cfac;
if(delE>e0) delE=e0;
e0-=delE;
//atomic dose calc
//add the lost energy to the dose array
urn=delE/cfac;
//urn*=10000.0f;
GAtomicAdd(&dose[indz+indx*NZ+indy*NZ*NCOL],urn);
//atomic_add(&dose[indz+indx*NZ+indy*NZ*NCOL],(int)urn);
if(e0<minSe || delE<0.0f)break;
}
}
/*save the local random number state for the next call to the kernel*/
ranluxcl_synchronize(&ranluxclstate);
ranluxcl_upload_seed(&ranluxclstate, ranluxcltab);
}
/*************************/
________________________________________
From: Andreas Kloeckner [lists(a)informa.tiker.net]
Sent: Tuesday, December 17, 2013 10:56 PM
To: Joe Haywood
Cc: pyopencl(a)tiker.net
Subject: RE: openCL your thoughts
Dear Joe,
please keep requests such as this to the mailing list. Thanks. I've cc'd
them on my reply.
Joe Haywood <haywoojr(a)mercyhealth.com> writes:
> I was hoping to pick your brain a little more. After rewriting my
> original Python/Cuda/C++ program to Python/PyOpenCL I have done some
> speed comparisons. I cannot get the pyopencl version to run as fast
> as the original. I have an NVidia GT 430 for testing. Running the
> original code, the program takes ~10 seconds to complete. Running the
> opencl version takes ~24 seconds to complete (not including build
> time). Both programs produce the same results, within the uncertainty
> I expect from a Monte Carlo code.
>
> The differences between the two are, the CUDA code uses the CURAND
> libray for random numbers, whereas the OPENCL code uses ranlux from
> pyopencl-ranlux.cl. The CUDA code is compiled as a callable library
> using nvcc with optimizations like -O3 -fast-math -mtune=native etc
> and called in Python using the weave library. The Opencl kernel is
> compiled using the -cl-mad-enable -cl-unsafe-math etc. compile
> options. In the CUDA code I have rewritten some of the functions to
> use the faster math like
> "theta=__fmul_rn(__fsqrt_rn(__fmul_rn(fac,dl)),__fdividef(__fmul_rn(A,PI),180.0f));"
Ranluxcl supports a 'luxury' setting that influences the speed of the
generator. This knob trades off speed against quality of random numbers.
> I have tried moving enqueue_copy commands around, not reinitializing
> the ranlux generator, etc but I cannot speed up the opencl version
> anymore. Is there something I am missing in Pyopencl that would help
> with this?
Have you tried measuring (using OpenCL event-based profiling) what is
actually taking time?
Andreas
_______________________________________________
PyOpenCL mailing list
PyOpenCL(a)tiker.net
http://lists.tiker.net/listinfo/pyopencl
Hi all,
I have been having quite a few issues creating sub-buffers in my code.
The first (minor) issue relates to the convention of buf[start:end]
inheriting the flags of buf when creating the sub buffer. This is
problematic when buf was created with the COPY_HOST_PTR flag. Such a
flag is not meaningful when creating sub buffers and so results in an
exception. It would therefore be nice if this flag could be masked out.
Secondly, the __getitem__ handler seems to adopt a non-standard
convention for the slice range of buf[start:size] rather than
buf[start:end] this causes the following snippet to fail on AMD hardware
(which is strict about buffer lengths):
import pyopencl as cl
ctx = cl.create_some_context()
buf = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, 55296000)
bufslc = buf[36864000:]
See line 1728 of wrap_cl.hpp which calls get_sub_region(start, end,
my_flags); but on line 1686 we have a prototype of: size_t origin,
size_t size, cl_mem_flags flags), hence the issue.
Given the above two issues I decided to switch to calling get_sub_region
directly. So:
buf[a:a + sz] => buf.get_sub_region(a, sz)
While my code now functions on all platforms I now get a segfault when
my application terminates:
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x8
[ 0] /lib64/libpthread.so.0(+0x102b0) [0x7fa8cf9b92b0]
[ 1] /usr/lib64/libnvidia-opencl.so.1(+0x24f270) [0x7fa8c0e5e270]
[ 2] /usr/lib64/libnvidia-opencl.so.1(+0x136650) [0x7fa8c0d45650]
[ 3] /usr/lib64/libnvidia-opencl.so.1(+0x128da5) [0x7fa8c0d37da5]
[ 4] /usr/lib64/libnvidia-opencl.so.1(+0x12926b) [0x7fa8c0d3826b]
[ 5] /usr/lib64/libnvidia-opencl.so.1(+0x1345ab) [0x7fa8c0d435ab]
[ 6]
/usr/lib64/python2.7/site-packages/pyopenc./_cl.so(_ZN8pyopencl13memory_object7releaseEv+0x25)
[0x7fa8c2079b95]
[ 7]
/usr/lib64/python2.7/site-packages/pyopencl/_cl.so(_ZN8pyopencl6bufferD0Ev+0x55)
[0x7fa8c2079d35]
[ 8]
/usr/lib64/python2.7/site-packages/pyopencl/_cl.so(_ZN5boost6python7objects14pointer_holderISt8auto_ptrIN8pyopencl6bufferEES5_ED1Ev+0x2e)
[0x7fa8c20751ce]
[ 9] /usr/lib64/libboost_python-2.7.so.1.53.0(+0x265ac) [0x7fa8c1db15ac]
[SNIP]
with the only difference being the above change. (I am stumped here!)
Regards, Freddie.
Hi all,
I have finally bitten the bullet and have started porting my solver from
CUDA to OpenCL. During a time-step it is necessary for MPI ranks to
exchange data. With PyCUDA and mpi4py our application proceeds as follows:
At start-up we allocate a page-locked buffer on the host and an
equally-sized buffer on the device. We also construct a persistent MPI
request for either sending the host buffer. Then, when the time is
right, we run a packing kernel on the device, initiate a device-to-host
copy, and then start the persistent MPI request.
Does anyone have any experience with performing this with OpenCL? From
what I can gather there are a variety of options, although none which
jump off the page. I am weary of getting the device to use a
memory-mapped host pointer (when I tried it with CUDA our performance
tanked). I can not also find a direct equivalent to pagelocked_empty in
OpenCL. ALLOC_HOST_PTR followed by an enqueue_map_buffer may be what I
want but am unsure if it fits in with persistent requests (it would need
to be mapped all of the time).
Regards, Freddie.