Invalid command Queue when using big data sets on nVidia
by Justin Heinermann

Dear all,
we are trying to implement a K nearest neighbor search on GPUs with
PyOpenCL. The goal of the algorithm is: For a given target point,
find the nearest points from a given set (training data). The distance
between two points is computed by the squared euclidean distance.
One of our implementations is a brute force approach, which aims
at processing big data sets in parallel, e.g. 1 million training data and
some millions of targets (test data). For every target point one kernel
instance is created which finds the k nearest points out of the
training points.
Our problem is the following. Everything works fine for small data sets
and the results are as expected on both GPU (GeForce GTX 650 with
nVidia Driver 313.09.) and CPU(Intel Core i5-3450 with AMD APP SDK)
running Ubuntu 12.10, PyOpenCL 2013.1-py2.7-linux-x86_64.
But if we increase the size of the data sets, the GPU version crashes
with the following error:
> File "brutegpu.py", line 65, in query
> cl.enqueue_copy(self.queue, d_min, self.d_min_buf).wait()
> File "/usr/local/lib/python2.7/dist-packages/
> pyopencl-2013.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py",
> line 935, in enqueue_copy
> return _cl._enqueue_read_buffer(queue, src, dest, **kwargs)
> pyopencl.LogicError: clEnqueueReadBuffer failed: invalid command queue
The CPU-Version still works fine with 1 million training points
and 1 million of test points. Attached you can find the corresponding
source code as working minimal example, which consists of on
Host-Python-File
and one OpenCL-Kernel-File.
We would highly apprecriate any help - maybe we made a
mistake which is already known to you.
So the big question for us is: Why is it working on CPU and why isn't it
working on the GPU?
Are there nVidia-specific pitfalls for such big data sets?
The compiler says:
> ptxas info : Compiling entry function 'find_knn' for 'sm_30'
> ptxas info : Function properties for find_knn
> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
> ptxas info : Used 17 registers, 336 bytes cmem[0], 4 bytes cmem[3]
Or are there any rules for using a kernel for big data sets such as setting
the work group sizes or maximum memory usage?
The error message "invalid command queue" is confusing and I wasn't able
to find any helpful information (except that oftentimes "invalid command
queue" means segfault, but i could not find any wrong array adress yet.)
Maybe one of you could have a look at our code and finds some stupid
mistake.
We would be very grateful for every hint.
Best regards,
Justin Heinermann,
University Oldenburg
5 years, 6 months

Segmentation fault in pyopencl.image_from_array
by Jerome Kieffer

Dear Python/OpenCL community,
I am pretty new (py)opencl and encountered a problem, maybe it a lack of understanding of openCL, but I found strange python seg-faults:
test program:
#!/usr/bin/python
import numpy, pyopencl
ctx = pyopencl.create_some_context()
data=numpy.random.random((1024,1024)).astype(numpy.float32)
img = pyopencl.image_from_array(ctx, ary=data, mode="r", norm_int=False, num_channels=1)
print img
System: debian sid: pyopencl2012.1 (the same code works with debian stable and v2011.2)
Here is the backtrace obtained with GDB:
0x0000000000000000 in ?? ()
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff340c253 in pyopencl::create_image_from_desc(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#2 0x00007ffff342de36 in _object* boost::python::detail::invoke<boost::python::detail::install_holder<pyopencl::image*>, pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>, boost::python::arg_from_python<unsigned long>, boost::python::arg_from_python<_cl_image_format const&>, boost::python::arg_from_python<_cl_image_desc&>, boost::python::arg_from_python<boost::python::api::object> >(boost::python::detail::invoke_tag_<false, false>, boost::python::detail::install_holder<pyopencl::image*> const&, pyopencl::image* (*&)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::arg_from_python<pyopencl::context const&>&, boost::python::arg_from_python<unsigned long>&, boost::python::arg_from_python<_cl_image_format const&>&, boost::python::arg_from_python<_cl_image_desc&>&, boost::python::arg_from_python<boost::python::api::object>&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#3 0x00007ffff342e06f in boost::python::detail::caller_arity<5u>::impl<pyopencl::image* (*)(pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object), boost::python::detail::constructor_policy<boost::python::default_call_policies>, boost::mpl::vector6<pyopencl::image*, pyopencl::context const&, unsigned long, _cl_image_format const&, _cl_image_desc&, boost::python::api::object> >::operator()(_object*, _object*) ()
from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#4 0x00007ffff311715b in boost::python::objects::function::call(_object*, _object*) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#5 0x00007ffff3117378 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
#6 0x00007ffff3120593 in boost::python::detail::exception_handler::operator()(boost::function0<void> const&) const ()
from /usr/lib/libboost_python-py27.so.1.49.0
#7 0x00007ffff3445983 in boost::detail::function::function_obj_invoker2<boost::_bi::bind_t<bool, boost::python::detail::translate_exception<pyopencl::error, void (*)(pyopencl::error const&)>, boost::_bi::list3<boost::arg<1>, boost::arg<2>, boost::_bi::value<void (*)(pyopencl::error const&)> > >, bool, boost::python::detail::exception_handler const&, boost::function0<void> const&>::invoke(boost::detail::function::function_buffer&, boost::python::detail::exception_handler const&, boost::function0<void> const&) () from /usr/lib/python2.7/dist-packages/pyopencl/_cl.so
#8 0x00007ffff3120373 in boost::python::handle_exception_impl(boost::function0<void>) ()
from /usr/lib/libboost_python-py27.so.1.49.0
#9 0x00007ffff3115635 in ?? () from /usr/lib/libboost_python-py27.so.1.49.0
Thanks for your help.
If you are not able to reproduce this bug, I should mention it to debian.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF
6 years

Re: [PyOpenCL] Windows 8 and PyOpenCL missing pytools
by Andreas Kloeckner

"=?UTF-8?Q?marcus.desto?=" <marcus.desto(a)o2.pl> writes:
> Hello Andreas,
>
> thanks for your mail.
>
> First of all I would like to say that I had subscribed to you pyopencl mailing list few days before I mailed you first time. But somehow the mailing list did not work.
>
> However.
>
> Dnia 11 lipca 2013 23:27 Andreas Kloeckner <lists(a)informa.tiker.net> napisał(a):
>
>>
>> Wrote a howto for you, since I agree that that area of PyOpenCL is
>> somewhat underdocumented:
>>
>> http://documen.tician.de/pyopencl/howto.html
>
> Thank you a lot. It helped me a lot.
>
> Now, sticking to the subject, I have installed python 2.7 and 3.3 on my testing windows 8 system using external sources for python addons (ifd.uci.edu/~gohlke/pythonlibs).
>
> Installation was fine, but as I tried to import pyopencl the python shell complained, there is no module named pytools (ImportError). It occured in pyopencl\__init__.py, line 679 and 448)
>
> Are there any solutions available solving that problem?
Download pytools from here:
https://pypi.python.org/pypi/pytools
unpack, cd pytools-2013.5.3, python setup.py install.
HTH,
Andreas
6 years, 5 months

clmathlibraries
by Sam

Hello,
I was wondering if anyone has looked at the clblas and clfft libraries that
AMD recently open sourced? Would it be possible to have these inside pyopencl
or should they be a seperate wrapper on top of it like pyfft? Also are they
vendor agnostic? The reason I ask is I suck at opencl but; I am a okay at
python and it would be awesome to be able to do cl.math.matrix.multiply(A,B)
and have it work. pyopencl is awesome but is still difficult to do work like
train a neural net on it. Currently most python gpu neural net libraries are
written in/for cuda (theano, cudamat, etc.) this is probably because of the
lack of a good BLAS for opencl. So is clBlas a candidate for a python wrapper?
github.com/clMathLibraries
6 years, 6 months

PyOpenCL on PyPy
by Matti Picus

Hi.
I am interested in getting PyOpenCL to work with PyPy, an
implementation of cpython with a JITwww.pypy.org . Has there been any
discussion or thought about doing this? PyPy has a basic
implementation of numpy called numpypy that I contribute to, and it
has a rudimentary numpy-compatible c interface available as an
external module at
https://bitbucket.org/antocuni/numpypy_c
The PyPy team has a cpython-compatible replacement for ctypes called
cffi, that is jit-friendly on PyPy and no slower than ctypes on
cpython.
So it seems like all the pieces exist to start, is anyone else interested in
getting the work done?
Or are there blocking issues I do not understand?
Matti
6 years, 6 months

pyOpenCL advice needed
by Pongsak Suvanpong

Dear All
I have a code a Multivariable Linear Regression in python using numpy show below. I am trying to learn pyOpenCL by writing the same algorithm in OpenCL(code shown below), some how no matter I tried(even using very large samples 200 features and 5000 samples), the openCL code always runs slower than serial code using numpy. I wonder if anyone could point out what have I down wrong. My setup is Mac OSX 10.8.4, video card NVIDIA GTX650, python3, numpy171 and pyopen check out from github.
"""
Multivariable Linear regression
by psksvp(a)gmail.com
based on Andrew Ng lecture (https://class.coursera.org/ml-003/lecture/index)
"""
import numpy as np
"""
Multivariables Linear Regression,
Currently there is no bias term.
class MultivariablesLinearRegressionSerial(object):
def __init__(self, numberOfFeatures):
self.parameterVector = np.array(np.random.random(numberOfFeatures))
def recall(self, inputVector):
return np.dot(inputVector, self.parameterVector)
def error(self, inputVector, target):
return self.recall(inputVector) - target
def learn(self, sampleMatrix, targetVector, learningRate=0.05, convergeAtError=0.001):
#if(len(sampleMatrix) != 2) #error checking
mSamples = sampleMatrix.shape[0] ## number of samples
#nFeatures = sampleMatrix.shape[1] ## number of features
howmuch = (learningRate * (1.0 / float(mSamples)))
converge = False
mse = 0.0
count = 0
while not converge:
output = np.dot(sampleMatrix, self.parameterVector)
#print(sampleMatrix)
#print(self.parameterVector)
#print(output)
error = output - targetVector
gradient = np.dot(error, sampleMatrix)
self.parameterVector = self.parameterVector - (gradient * howmuch)
error = np.dot(sampleMatrix, self.parameterVector) - targetVector
mse = np.dot(error, error)
print(mse)
if(mse <= convergeAtError):
converge = True
else:
count = count + 1
return count
#######################################################
import numpy as np
#import pyopencl as ocl
import pyopencl.array as oclArray
from psksvp import MathOCL
###################################################################################
#### openCL
##################################################################################
class MultivariablesLinearRegressionOpenCL(object):
def __init__(self, numberOfFeatures, openCLContext, queue):
self.parameterVector = np.array(np.random.random(numberOfFeatures), dtype=np.float32)
#self.parameterVector = np.zeros(shape=numberOfFeatures, dtype=np.float32)
self.openCLContext = openCLContext
self.queue = queue
def recall(self, inputVector):
return np.dot(inputVector, self.parameterVector)
def error(self, inputVector, target):
return self.recall(inputVector) - target
def learn(self, sampleMatrix, targetVector, learningRate=0.05, convergeAtError=0.001):
#if(len(sampleMatrix) != 2) #error checking
mSamples = sampleMatrix.shape[0] ## number of samples
nFeatures = sampleMatrix.shape[1] ## number of features
howMuch = (learningRate * (1.0 / float(mSamples)))
converge = False
#mse = 0.0
deviceParameterVector = oclArray.to_device(self.queue, self.parameterVector.T)
deviceTargetVector = oclArray.to_device(self.queue, targetVector)
deviceSampleMatrix = oclArray.to_device(self.queue, sampleMatrix)
deviceGradient = oclArray.zeros(self.queue, shape=nFeatures,
dtype=np.float32)
deviceError = oclArray.zeros(self.queue, shape=mSamples,
dtype=np.float32)
deviceOutput = oclArray.zeros(self.queue, shape=mSamples,
dtype=np.float32)
#deviceConvergeAtError = oclArray.to_device(self.queue, np.array(convergeAtError))
outputCalculator = MathOCL.MatrixMultipleKernel(AWidth=nFeatures, AHeight=mSamples,
BWidth=1, BHeight=nFeatures,
OpenCLContext=self.openCLContext,
queue=self.queue)
gradientCalculator = MathOCL.MatrixMultipleKernel(AWidth=mSamples, AHeight=1,
BWidth=nFeatures, BHeight=mSamples,
OpenCLContext=self.openCLContext,
queue=self.queue)
vectorToVectorSubtractor = MathOCL.VectorToVectorOperatorKernel("-", OpenCLContext=self.openCLContext,
queue=self.queue)
from pyopencl.elementwise import ElementwiseKernel
parameterUpdateKernel = ElementwiseKernel(context = self.openCLContext,
arguments = "float howMuch, float *gradient, float* parameters",
operation = "parameters[i] = parameters[i] - (gradient[i] * howMuch)",
name="parameterUpdated")
count = 0
while not converge:
#output = np.dot(sampleMatrix, self.parameterVector)
outputCalculator.multiply(deviceSampleMatrix.data, deviceParameterVector.data, deviceOutput.data)
#error = output - targetVector
vectorToVectorSubtractor.do(deviceOutput.data, deviceTargetVector.data, deviceError.data, deviceError.shape)
#gradient = np.dot(error, sampleMatrix)
gradientCalculator.multiply(deviceError.data, deviceSampleMatrix.data, deviceGradient.data)
#self.parameterVector = self.parameterVector - (gradient * howMuch)
parameterUpdateKernel(howMuch, deviceGradient, deviceParameterVector)
#error = np.dot(sampleMatrix, self.parameterVector) - targetVector
#output = np.dot(sampleMatrix, self.parameterVector)
#outputCalculator.multiply(deviceSampleMatrix.data, deviceParameterVector.data, deviceOutput.data)
#error = output - targetVector
#vectorToVectorSubtractor.do(deviceOutput.data, deviceTargetVector.data, deviceError.data, deviceError.shape)
mse = oclArray.dot(deviceError, deviceError)
print(mse)
if(mse <= convergeAtError):
self.parameterVector = deviceParameterVector.get(self.queue)
converge = True
else:
count = count + 1
return count
///////////
"""
Math routines for OpenCL. by psksvp(a)gmail.com
Matrix Multiplication is based on example provided by pyopencl package.
"""
from __future__ import division
matrixMultiplyKernelCode = """
// Thread block size
#define BLOCK_SIZE %(block_size)d
// Matrix dimensions
// (chosen as multiples of the thread block size for simplicity)
#define WA %(w_a)d // Matrix A width
#define HA %(h_a)d // Matrix A height
#define WB %(w_b)d // Matrix B width
#define HB WA // Matrix B height
#define WC WB // Matrix C width
#define HC HA // Matrix C height
/*
* Copyright 1993-2009 NVIDIA Corporation. All rights reserved.
*
* NVIDIA Corporation and its licensors retain all intellectual property and
* proprietary rights in and to this software and related documentation.
* Any use, reproduction, disclosure, or distribution of this software
* and related documentation without an express license agreement from
* NVIDIA Corporation is strictly prohibited.
*
* Please refer to the applicable NVIDIA end user license agreement (EULA)
* associated with this source code for terms and conditions that govern
* your use of this NVIDIA software.
*
*/
/* Matrix multiplication: C = A * B.
* Device code.
*/
#define AS(j, i) As[i + j * BLOCK_SIZE]
#define BS(j, i) Bs[i + j * BLOCK_SIZE]
////////////////////////////////////////////////////////////////////////////////
//! Matrix multiplication on the device: C = A * B
//! WA is A's width and WB is B's width
////////////////////////////////////////////////////////////////////////////////
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE,BLOCK_SIZE,1)))
void matrixMultiple( __global float* C, __global float* A, __global float* B)
{
__local float As[BLOCK_SIZE*BLOCK_SIZE];
__local float Bs[BLOCK_SIZE*BLOCK_SIZE];
// Block index
int bx = get_group_id(0);
int by = get_group_id(1);
// Thread index
int tx = get_local_id(0);
int ty = get_local_id(1);
// Index of the first sub-matrix of A processed by the block
int aBegin = WA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + WA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * WB;
// Csub is used to store the element of the block sub-matrix
// that is computed by the thread
float Csub = 0.0f;
// Loop over all the sub-matrices of A and B
// required to compute the block sub-matrix
for (int a = aBegin, b = bBegin;
a <= aEnd;
a += aStep, b += bStep) {
// Load the matrices from device memory
// to shared memory; each thread loads
// one element of each matrix
AS(ty, tx) = A[a + WA * ty + tx];
BS(ty, tx) = B[b + WB * ty + tx];
// Synchronize to make sure the matrices are loaded
barrier(CLK_LOCAL_MEM_FENCE);
// Multiply the two matrices together;
// each thread computes one element
// of the block sub-matrix
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
barrier(CLK_LOCAL_MEM_FENCE);
}
// Write the block sub-matrix to device memory;
// each thread writes one element
C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = Csub;
}
"""
#######################################
vectorToVectorKernelCode = """
__kernel void vectorToVectorOp(__global const float *a, __global const float *b, __global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] %(OPERATOR)s b[gid];
}
"""
import pyopencl as ocl
import pyopencl.array as oclArray
def makeMatrixMultipleKernel(AWidth, AHeight, BWidth, BHeight, OpenCLContext, queue, blockSize=1):
kernelParams = {"block_size": blockSize, "w_a":AWidth, "h_a":AHeight, "w_b":BWidth}
if "NVIDIA" in queue.device.vendor:
options = "-cl-mad-enable -cl-fast-relaxed-math"
else:
options = ""
kernel = ocl.Program(OpenCLContext, matrixMultiplyKernelCode % kernelParams,).build(options=options)
return kernel
## A * B = C
##c_width = b_width
##c_height = a_height
class MatrixMultipleKernel(object):
def __init__(self, AWidth, AHeight, BWidth, BHeight, OpenCLContext, queue, blockSize=1):
self.resultDim = (BWidth, AHeight)
self.blockSize = blockSize
self.queue = queue
self.kernel = makeMatrixMultipleKernel(AWidth, AHeight, BWidth, BHeight, OpenCLContext, queue, blockSize)
def multiply(self, deviceMatrix1, deviceMatrix2, deviceMatrixResult):
self.kernel.matrixMultiple(self.queue, self.resultDim ,
(self.blockSize, self.blockSize),
deviceMatrixResult,
deviceMatrix1,
deviceMatrix2).wait()
class VectorToVectorOperatorKernel(object):
def __init__(self, operator, OpenCLContext, queue):
self.OpenCLContext = OpenCLContext
self.queue = queue
self.kernel = ocl.Program(OpenCLContext,
vectorToVectorKernelCode % {"OPERATOR": operator}).build()
def do(self, deviceVector1, deviceVector2, deviceResult, shape):
self.kernel.vectorToVectorOp(self.queue,
shape,
None,
deviceVector1, deviceVector2, deviceResult).wait()
class VectorToVectorSubtractKernel(VectorToVectorOperatorKernel):
def __init__(self, OpenCLContext, queue):
VectorToVectorOperatorKernel.__init__(self, "-", OpenCLContext, queue)
#######################################################
def test():
import numpy
matrix1 = numpy.array([[ 7, 3],
[ 2, 5],
[ 6, 8],
[ 9, 0]], dtype=numpy.float32)
matrix2 = numpy.array([[ 7, 4, 9],
[ 8, 1, 5]], dtype=numpy.float32)
ctx = ocl.create_some_context()
queue = ocl.CommandQueue(ctx)
matrix3 = numpy.array([[0, 0, 0],
[0, 0, 0],
[0, 0, 0],
[0, 0, 0]], dtype=numpy.float32)
dA = oclArray.to_device(queue, matrix1)
dB = oclArray.to_device(queue, matrix2)
dC = oclArray.to_device(queue, matrix3)
#deviceA = ocl.Buffer(ctx, ocl.mem_flags.READ_ONLY | ocl.mem_flags.COPY_HOST_PTR, hostbuf=matrix1)
#deviceB = ocl.Buffer(ctx, ocl.mem_flags.READ_ONLY | ocl.mem_flags.COPY_HOST_PTR, hostbuf=matrix2)
#deviceC = ocl.Buffer(ctx, ocl.mem_flags.WRITE_ONLY, size=matrix3.nbytes)
print(numpy.dot(matrix1, matrix2))
mK = MatrixMultipleKernel(AWidth=matrix1.shape[1],
AHeight=matrix1.shape[0],
BWidth=matrix2.shape[1],
BHeight=matrix2.shape[0],
OpenCLContext=ctx, queue=queue)
mK.multiply(dA.data, dB.data, dC.data)
ocl.enqueue_copy(queue, matrix3, dC.data)
print(matrix3)
if __name__ == "__main__":
test()
6 years, 6 months

Re: [PyOpenCL] demo_mandelbrot.py: poor serial algorithm
by Andreas Kloeckner

"CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
> Hi,
> in pyopencl 2013.1, I've been looking into examples/demo_mandelbrot.py.
>
> The serial implementation of the algorithm is a brain-off unloop of the
> numpy implementation, which really does poor pure python no justice at all.
>
>
> The original implementation:
>
> def calc_fractal_serial(q, maxiter):
> # calculate z using numpy
> # this routine unrolls calc_fractal_numpy as an intermediate
> # step to the creation of calc_fractal_opencl
> # it runs slower than calc_fractal_numpy
> z = np.zeros(q.shape, np.complex64)
> output = np.resize(np.array(0,), q.shape)
> for i in range(len(q)):
> for iter in range(maxiter):
> z[i] = z[i]*z[i] + q[i]
> if abs(z[i]) > 2.0:
> q[i] = 0+0j
> z[i] = 0+0j
> output[i] = iter
> return output
>
>
> This doesn't take into account that
> 1) it uses an iterative approach, yet it doesn't take the opportunity to
> break out of the inner for cycle when it can!
> 2) python is NOT optimized for non-native data types such as np.complex64.
>
> My version:
>
> def calc_fractal_serial(q, maxiter):
> # calculate z using pure python on a numpy array
> # note that, unlike the other two implementations,
> # the number of iterations per point is NOT constant
> z = np.zeros(q.shape, complex)
> output = np.resize(np.array(0,), q.shape)
> for i in range(len(q)):
> for iter in range(maxiter):
> z[i] = z[i]*z[i] + q[i]
> if abs(z[i]) > 2.0:
> output[i] = iter
> break
> return output
>
>
> Time to process a 256x256 image:
> Original implementation: 12.56s
> After introducing the break in the loop: 4.44s
> After changing the data type from np.complex64 to complex: 1.54s
Thanks for the improvement--changed in git.
Andreas
6 years, 6 months

pyopencl matrix-multiply.py problem
by Pongsak Suvanpong

Dear all
I am trying to run the example program matrix-multiply.py in examples directory from pyopencl, but got the below error
pyopencl.LogicError: clEnqueueNDRangeKernel failed: invalid work group size
reading this message
http://lists.tiker.net/pipermail/pyopencl/2011-August/000867.html
seems to be similar problem, but not sure how to fix it.
my setup is python3.3 numpy 1.7.x osx 10.8.4, system reports OpenCL 1.2, pyopencl from clone from github
psksvp
6 years, 6 months

Re: [PyOpenCL] help with reductionKernel
by Pongsak Suvanpong

Hi Bogdan
numpy.version.version reports 1.8.0.dev-Unknow. I am running on python3.3 and pyopencl checked out from github.
thanks for any help
psksvp
On 15/08/2013, at 12:08 AM, Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> Hi Pongsak,
>
> What version of numpy do you have? This code works normally for me on
> the same configuration + numpy 1.7.1.
>
> On Wed, Aug 14, 2013 at 11:42 PM, Pongsak Suvanpong <psksvp(a)gmail.com> wrote:
>> Dear all
>>
>> I am trying to run the following from example in documentation but got error show below, I wonder what have I done wong
>>
>> I am running pyopencl on python3.3 osx 10.8.4
>>
>> thanks for any help
>>
>> import numpy
>> import pyopencl as cl
>> import pyopencl.array as clarray
>> from pyopencl.reduction import ReductionKernel
>>
>> ctx = cl.create_some_context()
>> queue = cl.CommandQueue(ctx)
>> krnl = ReductionKernel(ctx, numpy.float32, neutral="0",
>> reduce_expr="a+b", map_expr="x[i]*y[i]",
>> arguments="__global float *x, __global float *y")
>> x = clarray.arange(queue, 400, dtype=numpy.float32)
>> y = clarray.arange(queue, 400, dtype=numpy.float32)
>> m = krnl(x, y).get()
>>
>>
>>
>> ---------------------
>> /Users/psksvp/Local/Library/Python.framework/Versions/3.3/lib/python3.3/site-packages/pyopencl-2013.2-py3.3-macosx-10.8-x86_64.egg/pyopencl/__init__.py:61: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
>> "to see more.", CompilerWarning)
>> Traceback (most recent call last):
>> File "p.py", line 13, in <module>
>> m = krnl(x, y).get()
>> File "/Users/psksvp/Local/Library/Python.framework/Versions/3.3/lib/python3.3/site-packages/pyopencl-2013.2-py3.3-macosx-10.8-x86_64.egg/pyopencl/array.py", line 617, in get
>> ary = _as_strided(ary, strides=self.strides)
>> File "/Users/psksvp/Local/Library/Python.framework/Versions/3.3/lib/python3.3/site-packages/pyopencl-2013.2-py3.3-macosx-10.8-x86_64.egg/pyopencl/compyte/array.py", line 170, in as_strided
>> return np.asarray(_DummyArray(interface, base=x))
>> File "/Users/psksvp/Local/Library/Python.framework/Versions/3.3/lib/python3.3/site-packages/numpy/core/numeric.py", line 325, in asarray
>> return array(a, dtype, copy=False, order=order)
>> ValueError: __array_interface__ shape must be at least size 1
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net
>> http://lists.tiker.net/listinfo/pyopencl
6 years, 6 months

Re: [PyOpenCL] Good performance from a hacked-together GEMM?
by James Bergstra

Hi Bogdan, thanks for the plug. Part of writing was to find a home for
ongoing development. I'll check out Reikna.
On Wed, Aug 14, 2013 at 10:17 AM, Bogdan Opanchuk <mantihor(a)gmail.com>wrote:
> Hi James,
>
> My pet project, Reikna (http://reikna.publicfields.net/) has a simple
> implementation of matrix multiplication (mostly lifted from the
> corresponding example in nVidia SDK). More specifically, see
> https://github.com/Manticore/reikna/blob/develop/reikna/matrixmul.mako
>
> Now I understand that it is a shameless self-advertisement, but one of
> the main ideas of Reikna is to provide a framework for implementing
> GPGPU algorithms. I will appreciate it if you had a look at Reikna and
> consider the idea of implementing your GEMM with it.
>
> Best regards,
> Bogdan
>
> On Wed, Aug 14, 2013 at 11:58 PM, James Bergstra
> <james.bergstra(a)gmail.com> wrote:
> > Hi list, sort of writing a show-and-tell post here.
> >
> > Yesterday I hacked together a GEMM code generator using mako and PyOpenCL
> > and by including some of the tips from [1] I made a *CPU* GEMM
> > implementation that gets up around 35 GFLOP/s in single precision on my
> core
> > i7-3770 (using the intel OCL compiler). On another machine with an Intel
> > i7-960 and the AMD compiler it gets around 8.5 GFLOP/s with the same
> > settings that I used on the 3770. Proper auto-tuning would presumably
> raise
> > these rates to some extent.
> >
> > The code I used is here:
> >
> https://github.com/jaberg/python-benchmarks/blob/gemm_ocl/gemm/gemm_pyope...
> >
> > Basically I'm writing because I looked around a little and didn't find
> > PyOpenCL source code for GEMM. If I failed to find a more mature project,
> > could someone let me know? Also, if anyone has tips to make it work
> better,
> > patches welcome. The kernel doesn't prefetch anything, I'm slightly
> > suspicious of the clunkiness associated with "transposing" the little
> block
> > of A that's stored in vector registers, and I haven't yet plugged it into
> > any kind of profiler.
> >
> > More obviously, the `vectorized_text` code generator has some bugs in it,
> > but seems to work with some combinations of block values (e.g. 4, 4, 4
> and
> > 8, 8, 8).
> >
> > [1]
> > Kazuya Matsumoto, Naohito Nakasato, Stanislav G. Sedukhin
> > Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on
> > the GPU
> > ftp://ftp.u-aizu.ac.jp/u-aizu/doc/Tech-Report/2012/2012-002.pdf
> >
> > _______________________________________________
> > PyOpenCL mailing list
> > PyOpenCL(a)tiker.net
> > http://lists.tiker.net/listinfo/pyopencl
> >
>
6 years, 6 months