Re: [PyOpenCL] Conserve memory in OpenCl Program
by Tyler Hardin

I don't think that will work for this. I should have been a little more
precise in my short, stripped down version. The nodes are all integers and
two nodes are adjacent if their difference is equal to a number in resid.
The desired output is actually just the size of the max clique. If I
understand (not sure), what it looks like you're doing is only the first
part of the algo described below. I might be able to do that part in
Python, then the clique finding in OpenCL though, which would reduce the
size of memory needed for the CL part, so thanks!
Let's call the big graph, the one with all nodes in the array nodes[], H.
First, for gid N, I find the neighbourhood of nodes[N]. This forms a
subgraph, we'll call it HN. I don't know anything about this graph (whether
it's complete). Then, I take the resulting graph, and find it's max clique.
It really isn't an algorithmic improvement over the single threaded version
of the problem (actually, it's worse, because, say H has a max clique of C,
I'll find that same max clique at least C times. Once for each node in the
max clique.) The advantage comes from the average massive shrink in the
size of the graphs I'm working with. Instead of finding the max clique in a
graph of size |H|, I'm finding the max clique of N graphs of size less than
or equal to the largest order of any vertex in H.
On Tue, Oct 1, 2013 at 5:40 AM, CRV§ADER//KY <crusaderky(a)gmail.com> wrote:
> Let's see if I understood correctly,
>
> you want to resolve a O(n^2) problem, where
>
>
>
> out = [any(adj(input(i), input(j)) for j in G(input, i)) for i in
> range(len(input)]
>
>
>
> (in a very un-pythonic way, I’m cycling on indexes instead of values –
> just because I think it helps understand better).
>
>
>
> The problem is that I don't know what G is. N elements adjacent to i in
> the input vector? (in that case the problem is O(n)). A completely
> arbitrary, unpredictable function that extracts N elements from the whole
> input array? (O(n^2))
>
> The following is generic solution that works in the latter, worse case,
> even in the case where G(input, i) returns a vector as large as input, and
> allows to solve the problem on an arbitrarily big input (although
> performance will be hurt).
>
>
>
> You're trying to solve adj(input(i), input(j)) for all points of a square,
> each point having coordinates (i,j). You can visualize the input array as
> being both the vertical and the horizontal edge of the square. Now split
> the edge in N segments, small enough to fit in your memory; you're thus
> producing N^2 subsquares. You now have to resolve the problem on all
> subsquares in turn, loading only one or two at a time into memory.
>
>
>
> Let's say N=3. Your problem becomes
>
> inp1, inp2, inp3 = split(input, 3)
>
> out11 = [any(adj(inp1(i), inp1(j)) for j in G(inp1, i)) for i in
> range(len(inp1)]
>
> out12 = [any(adj(inp1(i), inp2(j)) for j in G(inp2, i)) for i in
> range(len(inp1)]
>
> out13 = [any(adj(inp1(i), inp3(j)) for j in G(inp3, i)) for i in
> range(len(inp1)]
>
> out21 = [any(adj(inp2(i), inp1(j)) for j in G(inp1, i)) for i in
> range(len(inp2)]
>
> out22 = [any(adj(inp2(i), inp2(j)) for j in G(inp2, i)) for i in
> range(len(inp2)]
>
> out23 = [any(adj(inp2(i), inp3(j)) for j in G(inp3, i)) for i in
> range(len(inp2)]
>
> out31 = [any(adj(inp3(i), inp1(j)) for j in G(inp1, i)) for i in
> range(len(inp3)]
>
> out32 = [any(adj(inp3(i), inp2(j)) for j in G(inp2, i)) for i in
> range(len(inp3)]
>
> out33 = [any(adj(inp3(i), inp3(j)) for j in G(inp3, i)) for i in
> range(len(inp3)]
>
> Then you have to recombine out.
>
>
>
> This MAY OR MAY NOT work for you, depending on the math properties of G().
> The math property I'm relying on is that I can apply G to a subset of
> input, recombine the output afterwards and obtain the same result. This may
> not be the case, e.g. G(input, i) = “the 4 nodes whose value is closest to
> input(i)”. You can still solve this case, but it adds an additional
> complication: for every output point, you’ll have to save the input that
> originated it as well, and recursively apply G to the “local winner” inputs
> when recombining.
>
>
>
> As you can see, this solution presents much more memory transfers than the
> monolithic one and considerably more complex to implement on the python
> side - this can't be helped.
>
> However, you'll mitigate the performance degradation if you load the next
> buffer while you calculate the current one, e.g. load everything needed to
> calculate out12 while you’re calculating out11.
>
> HTH
> On 1 Oct 2013 06:13, "Tyler Hardin" <tghardin1(a)catamount.wcu.edu> wrote:
>
>> Hi all, I'm writing an opencl kernel to find maximal cliques. The kernel
>> selects a node by the thread gid and finds the max subgraph containing that
>> node. Unfortunately, around 10000 nodes, I get an out of memory error. The
>> program is for one of my math professors and tests a conjecture of his, so
>> I can't post the code in it's entirety, but I'll give a simplified example.
>>
>> I've tried cutting down the size of the static arrays in the functions
>> that have them. Is there anything else I can do other than get a card with
>> more memory?
>>
>> Thanks for your time,
>> Tyler
>>
>> #!/usr/bin/python2.7
>>
>> import pyopencl as cl
>> import numpy
>>
>> def max_clique(nodes_list, resid_list):
>> resid_list.append(0)
>>
>> num_nodes = numpy.int32(len(nodes_list))
>> num_resid = numpy.int32(len(resid_list))
>>
>> nodes_arr = numpy.array(nodes_list, dtype='i4')
>> resid_arr = numpy.array(resid_list, dtype='i4')
>>
>> ctx = cl.create_some_context()
>> queue = cl.CommandQueue(ctx)
>>
>> mf = cl.mem_flags
>> nodes_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
>> hostbuf=nodes_arr)
>> resid_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
>> hostbuf=resid_arr)
>> dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, nodes_arr.nbytes)
>>
>> prg = cl.Program(ctx, """
>> bool adj(int node1, int node2, __global int *adj)
>> {
>> // are node1 and node2 adjacent?
>> }
>>
>> // checks if node is adj to all nodes in chain
>> bool adj_all(int* nodes, int* G, int chain_len, int node,
>> __global int* adj){
>> // is node adjacent to to all nodes in G?
>> }
>>
>> #define MAX_NODES 64
>> int chain(int num_nodes, int* nodes, __global int* connect)
>> {
>> int cur[MAX_NODES];
>>
>> // Find max clique in G
>> }
>>
>> __kernel void max_clique(__global int *nodes,
>> __global int *resid,
>> __global int *out,
>> int num_nodes,
>> int num_resid)
>> {
>> int gid = get_global_id(0);
>> int G[constant];
>> // find subgraph around nodes[gid], we call it G, store
>> // it's nodes in G
>> out[gid] = // max clique of G;
>> }
>> """).build()
>>
>> prg.max_clique(queue, nodes_arr.shape, None, nodes_buf,
>> resid_buf,
>> dest_buf,
>> num_nodes,
>> num_resid)
>> cliques = numpy.empty_like(nodes_arr)
>> cl.enqueue_copy(queue, cliques, dest_buf)
>> return max(cliques)
>>
>> print max_clique(range(0, 10000), range(1, 10))
>>
>> _______________________________________________
>> PyOpenCL mailing list
>> PyOpenCL(a)tiker.net
>> http://lists.tiker.net/listinfo/pyopencl
>>
>>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>
>
5 years, 9 months

Re: [PyOpenCL] Conserve memory in OpenCl Program
by CRV§ADER//KY

Let's see if I understood correctly,
you want to resolve a O(n^2) problem, where
out = [any(adj(input(i), input(j)) for j in G(input, i)) for i in
range(len(input)]
(in a very un-pythonic way, I’m cycling on indexes instead of values – just
because I think it helps understand better).
The problem is that I don't know what G is. N elements adjacent to i in the
input vector? (in that case the problem is O(n)). A completely arbitrary,
unpredictable function that extracts N elements from the whole input array?
(O(n^2))
The following is generic solution that works in the latter, worse case,
even in the case where G(input, i) returns a vector as large as input, and
allows to solve the problem on an arbitrarily big input (although
performance will be hurt).
You're trying to solve adj(input(i), input(j)) for all points of a square,
each point having coordinates (i,j). You can visualize the input array as
being both the vertical and the horizontal edge of the square. Now split
the edge in N segments, small enough to fit in your memory; you're thus
producing N^2 subsquares. You now have to resolve the problem on all
subsquares in turn, loading only one or two at a time into memory.
Let's say N=3. Your problem becomes
inp1, inp2, inp3 = split(input, 3)
out11 = [any(adj(inp1(i), inp1(j)) for j in G(inp1, i)) for i in
range(len(inp1)]
out12 = [any(adj(inp1(i), inp2(j)) for j in G(inp2, i)) for i in
range(len(inp1)]
out13 = [any(adj(inp1(i), inp3(j)) for j in G(inp3, i)) for i in
range(len(inp1)]
out21 = [any(adj(inp2(i), inp1(j)) for j in G(inp1, i)) for i in
range(len(inp2)]
out22 = [any(adj(inp2(i), inp2(j)) for j in G(inp2, i)) for i in
range(len(inp2)]
out23 = [any(adj(inp2(i), inp3(j)) for j in G(inp3, i)) for i in
range(len(inp2)]
out31 = [any(adj(inp3(i), inp1(j)) for j in G(inp1, i)) for i in
range(len(inp3)]
out32 = [any(adj(inp3(i), inp2(j)) for j in G(inp2, i)) for i in
range(len(inp3)]
out33 = [any(adj(inp3(i), inp3(j)) for j in G(inp3, i)) for i in
range(len(inp3)]
Then you have to recombine out.
This MAY OR MAY NOT work for you, depending on the math properties of G().
The math property I'm relying on is that I can apply G to a subset of
input, recombine the output afterwards and obtain the same result. This may
not be the case, e.g. G(input, i) = “the 4 nodes whose value is closest to
input(i)”. You can still solve this case, but it adds an additional
complication: for every output point, you’ll have to save the input that
originated it as well, and recursively apply G to the “local winner” inputs
when recombining.
As you can see, this solution presents much more memory transfers than the
monolithic one and considerably more complex to implement on the python
side - this can't be helped.
However, you'll mitigate the performance degradation if you load the next
buffer while you calculate the current one, e.g. load everything needed to
calculate out12 while you’re calculating out11.
HTH
On 1 Oct 2013 06:13, "Tyler Hardin" <tghardin1(a)catamount.wcu.edu> wrote:
> Hi all, I'm writing an opencl kernel to find maximal cliques. The kernel
> selects a node by the thread gid and finds the max subgraph containing that
> node. Unfortunately, around 10000 nodes, I get an out of memory error. The
> program is for one of my math professors and tests a conjecture of his, so
> I can't post the code in it's entirety, but I'll give a simplified example.
>
> I've tried cutting down the size of the static arrays in the functions
> that have them. Is there anything else I can do other than get a card with
> more memory?
>
> Thanks for your time,
> Tyler
>
> #!/usr/bin/python2.7
>
> import pyopencl as cl
> import numpy
>
> def max_clique(nodes_list, resid_list):
> resid_list.append(0)
>
> num_nodes = numpy.int32(len(nodes_list))
> num_resid = numpy.int32(len(resid_list))
>
> nodes_arr = numpy.array(nodes_list, dtype='i4')
> resid_arr = numpy.array(resid_list, dtype='i4')
>
> ctx = cl.create_some_context()
> queue = cl.CommandQueue(ctx)
>
> mf = cl.mem_flags
> nodes_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
> hostbuf=nodes_arr)
> resid_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
> hostbuf=resid_arr)
> dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, nodes_arr.nbytes)
>
> prg = cl.Program(ctx, """
> bool adj(int node1, int node2, __global int *adj)
> {
> // are node1 and node2 adjacent?
> }
>
> // checks if node is adj to all nodes in chain
> bool adj_all(int* nodes, int* G, int chain_len, int node, __global
> int* adj){
> // is node adjacent to to all nodes in G?
> }
>
> #define MAX_NODES 64
> int chain(int num_nodes, int* nodes, __global int* connect)
> {
> int cur[MAX_NODES];
>
> // Find max clique in G
> }
>
> __kernel void max_clique(__global int *nodes,
> __global int *resid,
> __global int *out,
> int num_nodes,
> int num_resid)
> {
> int gid = get_global_id(0);
> int G[constant];
> // find subgraph around nodes[gid], we call it G, store
> // it's nodes in G
> out[gid] = // max clique of G;
> }
> """).build()
>
> prg.max_clique(queue, nodes_arr.shape, None, nodes_buf,
> resid_buf,
> dest_buf,
> num_nodes,
> num_resid)
> cliques = numpy.empty_like(nodes_arr)
> cl.enqueue_copy(queue, cliques, dest_buf)
> return max(cliques)
>
> print max_clique(range(0, 10000), range(1, 10))
>
> _______________________________________________
> PyOpenCL mailing list
> PyOpenCL(a)tiker.net
> http://lists.tiker.net/listinfo/pyopencl
>
>
5 years, 9 months

Re: [PyOpenCL] Conserve memory in OpenCl Program
by Andreas Kloeckner

Tyler Hardin <tghardin1(a)catamount.wcu.edu> writes:
> The stack trace says the error occurs in the enqueue_copy right before the
> return of the max_clique (the Python function). Since the input buffers are
> no longer needed at this point, if I could deallocate those, it seems that
> would increase the size of graph I can work with without an error.
> (Because, that way, I would have a maximum of three arrays in memory at a
> time, instead of four.)
Please make sure to keep the list cc'd.
What you suggest won't work, because OpenCL executes asynchronously. You
get errors (including out-of-memory) at pretty much arbitrary times
*after* allocation (specifically, not *at* allocation time).
HTH,
Andreas
5 years, 9 months

Re: [PyOpenCL] Conserve memory in OpenCl Program
by Andreas Kloeckner

Hi Tyler,
Tyler Hardin <tghardin1(a)catamount.wcu.edu> writes:
> Hi all, I'm writing an opencl kernel to find maximal cliques. The kernel
> selects a node by the thread gid and finds the max subgraph containing that
> node. Unfortunately, around 10000 nodes, I get an out of memory error. The
> program is for one of my math professors and tests a conjecture of his, so
> I can't post the code in it's entirety, but I'll give a simplified example.
>
> I've tried cutting down the size of the static arrays in the functions that
> have them. Is there anything else I can do other than get a card with more
> memory?
where is your out-of-memory error occurring? From your description, it
sounds like it is happening when you call your kernel. Note that OpenCL
implementations may (and do) defer memory allocation until the first use
of the memory, so your kernel likely has little to do with this. (You
can see this by zero-filling the arrays, or even small parts of them.)
So, while this is not what what you wanted to hear, if the buffers you
allocate do not fit onto the GPU that you have, you may need to buy a
bigger card.
If you're using an AMD GPU, note that there are some environment
variables that control how much graphics memory is available (see the
AMD developer guide).
HTH,
Andreas
5 years, 9 months

Conserve Memory in OpenCl Program
by Tyler Hardin

(Sorry if you're getting this twice. I sent it from my secondary email
instead of the one I registered with by accident.)
Hi all, I'm writing an opencl kernel to find maximal cliques. The kernel
selects a node by the thread gid and finds the max subgraph containing that
node. Unfortunately, around 10000 nodes, I get an out of memory error. The
program is for one of my math professors and tests a conjecture of his, so
I can't post the code in it's entirety, but I'll give a simplified example.
I've tried cutting down the size of the static arrays in the functions that
have them. Is there anything else I can do other than get a card with more
memory? Also, the error is occurring at the enqueue_copy line, which seems
to suggest all is well until I try to retrieve the results. Is there any
way I deallocate the nodes_buf and resid_buf buffers to free memory to make
sure the copy can work?
Thanks for your time,
Tyler
#!/usr/bin/python2.7
import pyopencl as cl
import numpy
def max_clique(nodes_list, resid_list):
resid_list.append(0)
num_nodes = numpy.int32(len(nodes_list))
num_resid = numpy.int32(len(resid_list))
nodes_arr = numpy.array(nodes_list, dtype='i4')
resid_arr = numpy.array(resid_list, dtype='i4')
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
nodes_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=nodes_arr)
resid_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=resid_arr)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, nodes_arr.nbytes)
prg = cl.Program(ctx, """
bool adj(int node1, int node2, __global int *adj)
{
// are node1 and node2 adjacent?
}
// checks if node is adj to all nodes in chain
bool adj_all(int* nodes, int* G, int chain_len, int node, __global
int* adj){
// is node adjacent to to all nodes in G?
}
#define MAX_NODES 64
int chain(int num_nodes, int* nodes, __global int* connect)
{
int cur[MAX_NODES];
// Find max clique in G
}
__kernel void max_clique(__global int *nodes,
__global int *resid,
__global int *out,
int num_nodes,
int num_resid)
{
int gid = get_global_id(0);
int G[constant];
// find subgraph around nodes[gid], we call it G, store
// it's nodes in G
out[gid] = // max clique of G;
}
""").build()
prg.max_clique(queue, nodes_arr.shape, None, nodes_buf,
resid_buf,
dest_buf,
num_nodes,
num_resid)
cliques = numpy.empty_like(nodes_arr)
cl.enqueue_copy(queue, cliques, dest_buf)
return max(cliques)
print max_clique(range(0, 10000), range(1, 10))
5 years, 9 months

Conserve memory in OpenCl Program
by Tyler Hardin

Hi all, I'm writing an opencl kernel to find maximal cliques. The kernel
selects a node by the thread gid and finds the max subgraph containing that
node. Unfortunately, around 10000 nodes, I get an out of memory error. The
program is for one of my math professors and tests a conjecture of his, so
I can't post the code in it's entirety, but I'll give a simplified example.
I've tried cutting down the size of the static arrays in the functions that
have them. Is there anything else I can do other than get a card with more
memory?
Thanks for your time,
Tyler
#!/usr/bin/python2.7
import pyopencl as cl
import numpy
def max_clique(nodes_list, resid_list):
resid_list.append(0)
num_nodes = numpy.int32(len(nodes_list))
num_resid = numpy.int32(len(resid_list))
nodes_arr = numpy.array(nodes_list, dtype='i4')
resid_arr = numpy.array(resid_list, dtype='i4')
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
nodes_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=nodes_arr)
resid_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR,
hostbuf=resid_arr)
dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, nodes_arr.nbytes)
prg = cl.Program(ctx, """
bool adj(int node1, int node2, __global int *adj)
{
// are node1 and node2 adjacent?
}
// checks if node is adj to all nodes in chain
bool adj_all(int* nodes, int* G, int chain_len, int node, __global
int* adj){
// is node adjacent to to all nodes in G?
}
#define MAX_NODES 64
int chain(int num_nodes, int* nodes, __global int* connect)
{
int cur[MAX_NODES];
// Find max clique in G
}
__kernel void max_clique(__global int *nodes,
__global int *resid,
__global int *out,
int num_nodes,
int num_resid)
{
int gid = get_global_id(0);
int G[constant];
// find subgraph around nodes[gid], we call it G, store
// it's nodes in G
out[gid] = // max clique of G;
}
""").build()
prg.max_clique(queue, nodes_arr.shape, None, nodes_buf,
resid_buf,
dest_buf,
num_nodes,
num_resid)
cliques = numpy.empty_like(nodes_arr)
cl.enqueue_copy(queue, cliques, dest_buf)
return max(cliques)
print max_clique(range(0, 10000), range(1, 10))
5 years, 9 months