On Oct 28, CRV§ADER//KY modulated:
> ... Individual instruments
> of the same type may have more or less sample points for their curve
OK, I understand that it would be awkward to pack multiple
instruments' worth of attributes into one NDarray. So you'll need at
least 20K OpenCL jobs, each of which evalutes 2M scenarios. The job
will parallelize scenarios for the same instrument.
This is a very different operating regime than I am familar with. I
might run about 5 jobs to process one whole image, or perhaps 50-100
if I am using sub-block decomposition (repeating the 5 jobs in a
Python loop with different NDarrays as inputs).
> Can the instrument attributes fit in local device memory?
> Yes, as I said it's 50MB after compressing, more like 300-500MB
> I can optimize though.
I don't mean node memory in a cluster, but OpenCL "local memory" that
is shared by compute units in a work group. So, the question is about
the attributes for a single instrument. Ideally, you'd want one
leader to fetch the attributes from OpenCL global memory into local
memory, and then let all work group members reuse these attributes.
The scenario values will be distinct for each work item, so those can
be fetched directly from global memory by each work item. Hopefully
you can vectorize these global loads, whether explicitly in your
kernel or just by auto-vectorization in the OpenCL compiler.
> Didn't start writing the code yet... I'm still in an exploration phase.
I don't know how complex one instrument calculation is, but it seems
to me that you should try to write kernels for a couple instrument
sub-classes so you can measure their run times. E.g. try the simplest
kernel (in both programming effort and runtime) and then a more costly
one if it still seems viable to proceed.
With a rough idea of the OpenCL run time for one instrument on all 2M
scenarios, you'll have to return to the per-job preparation cost. I
would be concerned that this cost might dominate the total run time,
if the Python interpreter is doing too much work to marshal the OpenCL
input and output buffers and manage file IO.
A few questions and comments in-line below...
On Oct 21, CRV§ADER//KY modulated:
> 1. circa 20,000 Python objects of class "Instrument". Each instrument
> is defined by a subclass and a dict of attributes, which will be a
> handful of scalars most of the times, or a couple of MB worth of numpy
> arrays in the worst case.
> Total size: 50MB after compressing the numpy arrays
If I understand correctly, each attribute may be a scalar or a 1D
Numpy array of variable length? Does the attribute shape vary by
individual instrument or by instrument sub-class?
> 2. 120 risk factors, each of which is a numpy 1D array of 2 million
> doubles (the risk factor values for each simulation scenario)
> Total size: 1.8GB
So, each logical kernel needs as input one 120 scalar row of risk
factors and one instrument attribute set, and it outputs one scalar?
> My calculation happens in two phases:
> Simulation: for every one of the 20,000 instruments, calculate the
> instrument value as a function of the instrument scalar settings and a
> subset of the risk factor vectors. There will be different functions
> (kernels) depending on the instrument subclass. The output is always a
> 1D array of 2 million doubles per instrument - or if you prefer, a 2D
> arrray of 20,000 x 2,000,000 doubles. Some instruments require as input
> the output value of other instruments, in a recursive dependency tree.
> Total output size: 300GB
Have you already micro-benchmarked any mappings of this to OpenCL? It
seems to me worth checking:
A. K OpenCL jobs of shape (N,), with each worker evaluating one of K
instruments for one of N scenarios.
B. B OpenCL jobs of shape (K/B, N), with each worker evaluating one
of K instruments for one of N scenarios in one of B blocks.
Can the instrument attributes fit in local device memory? If so, this
can easily benefit (A) and may also help in (B) if you can structure
the global shape (K/B, N) into smaller (W, N) workgroups that share a
These are of course for K instruments in a single class, so they can
share the same kernel. The output should be a K x N array of scalars,
if I understand your problem statement. I'd limit the numbers K and N
for testing, before worrying about further decomposition to fit the
device and driver limits which probably cannot cope with a 20K much
less 2M job shape axis.
I'd test on both GPU and CPU devices, including existing devices in
your cluster. If your cluster isn't the latest generation of CPUs
and/or GPUs, I'd also try to test on newer equipment; there could be
dramatic performance improvements that would allow a much smaller
number of new devices to meet or exceed a large pool of older ones...
> Vertical aggregation:
> I calculate the value of circa 150 nodes, each of which is a vector of
> 2 million doubles defined as a weighted sum of the value of up to 8,000
> instruments (with the weights being scalar): node_value = instr_value1
> * k1 + instr_value2 * k2 + ... + instr_valueN * kn
> Each of the 20,000 instruments can contribute to more than one of the
> output 150 nodes.
This phase seems trivially parallelizable and vectorizable. You can
almost dismiss it while optimizing the phase 1 work and overall data
You've only described a K x N processing problem, where you would run
N kernels that each process one row of K values. You haven't
described any cross-communication or data shuffling if there are
multiple such sub-problems, nor approximate amount of work per input
or output data. Are your tasks truly independent? What data
management do you have to do to get your inputs into a parallel or
At one far extreme, a high throughput job manager could be used to
execute a set of independent PyOpenCL programs, each sized to fit on
your OpenCL devices, each processing a different input file containing
a subset of your N rows of data.
In the middle are a huge number of choices to balance IO, memory, and
compute resources. This leads to a huge number of different research
programs all focusing on different niches and machine models.
If you really want to abstract away the GPGPU devices, you might want
to look at OpenMP or similar projects that have tried to add such
devices to their targets for auto-vectorization. I don't work in that
space, and so have no specific recommendations.
At the other extreme, I adopted PyOpenCL to allow me to do my ad-hoc
processing in Python and Numpy with OpenCL callouts for certain
bottlenecks. I have some image processing tasks where there isn't
even enough compute time per byte of input to warrant the PCIe
transfer bottleneck in some cases. It is the same speed to run on an
i7-4700MQ mobile quad-core CPU (using just SIMD+multicore) as to run
on a desktop Kepler GPU.
For me, the data IO from disk or network would also dominate, so
distributed processing is pointless. Even still, I have used
explicit sub-block decomposition to split my large images into smaller
OpenCL tasks that can be marshaled through the system RAM or GPU to
improve locality and limit the intermediate working set sizes.
"CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
> *sigh* So all that exists is an academic publication that you need to pay
> to even read? Also from the abstract I understand it targets multiple GPUs
> on a single host and introduces memory management (not sure if virtual
> VRAM) ; can't see anything related on running a problem on multiple hosts
> in parallel...
There's also this:
"CRV§ADER//KY" <crusaderky(a)gmail.com> writes:
> Hi all,
> I'm looking into setting up a cluster of GPGPU nodes. The nodes would be
> Linux based, and communicate between each other via ethernet. Each node
> would have multiple GPUs.
> I need to run a problem that for 99% can be described as y[i] = f(x1[i],
> x2[i], ... xn[i]), running on 1D vectors of data. In other words, I have n
> input vectors and 1 output vector, all of the same size, and worker i-th
> will exclusively need to access element i-th of every vector.
> Are there any frameworks, preferably in Python and with direct access to
> OpenCL, that allow to transparently split the input data in segments, send
> them over the network, do caching, feeding executor queues, etc. etc.?
> Data reuse is very heavy so if a vector is already in VRAM I don't want to
> load it twice.
> Also, are there PyOpenCL bolt-ons that allow for virtual VRAM? That is, to
> have more buffers than you can fit in VRAM, and transparently swap to
> system RAM those thare are not immediately needed?
VirtCL is one. There was another, but I forgot what it was called.
I'm looking into setting up a cluster of GPGPU nodes. The nodes would be
Linux based, and communicate between each other via ethernet. Each node
would have multiple GPUs.
I need to run a problem that for 99% can be described as y[i] = f(x1[i],
x2[i], ... xn[i]), running on 1D vectors of data. In other words, I have n
input vectors and 1 output vector, all of the same size, and worker i-th
will exclusively need to access element i-th of every vector.
Are there any frameworks, preferably in Python and with direct access to
OpenCL, that allow to transparently split the input data in segments, send
them over the network, do caching, feeding executor queues, etc. etc.?
Data reuse is very heavy so if a vector is already in VRAM I don't want to
load it twice.
Also, are there PyOpenCL bolt-ons that allow for virtual VRAM? That is, to
have more buffers than you can fit in VRAM, and transparently swap to
system RAM those thare are not immediately needed?
Blair Azzopardi <blairuk(a)gmail.com> writes:
> Looks like I've found a workaround. It involves setting the set_args
> parameters as object properties.
Right--set_args(something) does not hold a (Python) reference to
'something', so that object may get garbage-collected.
Longgang Pang <pang(a)fias.uni-frankfurt.de> writes:
> I have problem when run the most recent PyOpenCL2015 with python2.7.6
> and 2.7.8
> The full output is as following, looks like 'ArgumentError' has no
> attribute 'what' for python version 2.7.6 and 2.7.8?
> There is no problem when I run pyopencl with python/3.3.6 or python/2.7.10.
> Do you know how to fix it in the code without modifying pyopencl?
Can you submit some code that reproduces this? ArgumentError should not
be caught by this handler.