[PyOpenCL] Learning OpenCL through PyOpenCL

Patric Holmvall patric.hol at gmail.com
Mon Feb 21 12:21:11 PST 2011


Thank you Andreas and Keith!
I think I just have a hard time getting my head around how I should
set up my workgroup size/local size, thread strides and so on, but
I'll get around it eventually I guess.

Another topic:
The basic examples covers how to do operations on vectors. I'm trying
to increase the dimensions and want to do operations on a matrix/array
of lets say Nx6 elements.

1. Is there anything wrong with the following approach? (syntax? am I
"thinking wrong" when passing arrays like this to the kernel?)
2. How can I split the other dimension between threads and workgroups
in an optimal way (like x[gid_1][gid_2] for example)? Example code
would be very appreachiated.
3. I suspect that I have a syntax error in the OpenCL code when
accessing elements, like x[gid][2] for example. What is the correct
syntax?

(Notice that this might not be the best example as the calculations
could be reduced to vectors, but I will need to learn how to do
similiar stuf eventually.)

#### Python ####
N = 10**6
x = numpy.random.random(N,6).astype(numpy.float32)  # Allocate a Nx6
array of random numbers
x_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=x)
fx_buf = cl.Buffer(ctx, mf.WRITE_ONLY, x.nbytes)
prg.multidim_mc(queue, x.shape, None, x_buf, fx_buf, numpy.uint32(N))

#### OpenCL Kernel ####
prg = cl.Program(ctx, """
    __kernel void multidim_mc(__global const float *x, __global float
*fx, uint const N)
    {
      int gid = get_global_id(0);

      if(gid >= N)
      {
        return;
      }
      fx[gid] = (x[gid][0]-x[gid][3])^2 + (x[gid][1]-x[gid][4])^2 +
(x[gid][2]-x[gid][5])^2;
    }
    """).build()

Thanks a lot for a great community!
Yours sincerely,
Patric


On Sun, Feb 20, 2011 at 5:40 PM, Keith Brafford
<keith.brafford at gmail.com> wrote:
> Patric Holmvall <patric.hol at ...> writes:
>> Can someone give me
>> pointers on how I find optimum workgroup size, threads and so on for
>> Radeon HD5000 series (HD5850, Stream Processors: 1440) and AMD Phenom
>> II X4 (965BE, 3.4 GHz quadcore) for example 10^9 calculations (10^9
>> dimensional vectors that I operate on)?
>> (Or are those parameters governed by the hardware at all?)
>
> You should join the AMD OpenCL forum for those sorts of questions:
>
> http://forums.amd.com/devforum/categories.cfm?catid=390&entercat=y
>
> My quick answer:
>
> 1) Your 1440 stream processors are really 288 VLIW processors with 5 slots, 4
> for single precision floating point and 1 fancy slot for other stuff:
>
> http://images.bit-tech.net/content_images/2009/09/ati-radeon-hd-5870-architecture-analysis/shadercore.jpg
>
>  So for floats, you really have 1152 processors, if you can make heavy use of
> the float4 data type.  Otherwise you only have 288 processors.
>
> 2) As far as workgroup size goes, your 288 VLIW processors are divided into 18
> "compute units" (the red rectangles that occupy most of the room in this
> picture, which shows 20 since it's the 5870):
>
> http://images.bit-tech.net/content_images/2009/09/ati-radeon-hd-5870-architecture-analysis/flow.jpg
>
> So if you make your workgroup size a multiple of 16 it will fit nicely into the
> architecture.  But remember, if you are able to use the float4 data type, you
> will be able to do 64 pieces of work in 16 work items.
>
> --Keith Brafford
>
>
>
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL at tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>



More information about the PyOpenCL mailing list