Received from Keith Brown on Wed, Nov 25, 2015 at 03:47:32PM EST:
> Hi Lev,
>
> Here is the context.
>
> np.dot(a.T,a)
> Here a.T is a view and doesnt take up a lot of memory.
> if I would do
> np.dot(a.T.copy(),a) it would take up more memory.
>
> This is part of my memory saving quest on the GPU. Trying to find some
> ways around it...
If you tell skcuda.linalg.dot() to treat its first argument as transposed, you
don't need to copy the matrix:
skcuda.linalg.dot(a, a, 'T', 'N')
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/http://neurokernel.github.io/

Received from Keith Brown on Wed, Nov 25, 2015 at 02:07:02PM EST:
> Is it possible to have pycuda return a view instead of a real array?
To what purpose? What do you mean by a view in this case?
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/http://neurokernel.github.io/

So, it turns out why its working on CPU because a.T is a view and
isn't occupying a lot of memory (if any). Now, for pycuda I need to do
a a.T.copy() to get it to work but this takes up more memory which is
leading to a memory allocation error.
Does anyone have an example of dot product with streams?
On Mon, Nov 23, 2015 at 3:14 PM, Keith Brown <keith6014(a)gmail.com> wrote:
> Thanks all for the replies.
>
> My goal is simple. Atleast, I though it was simple :-)
>
> I have function where I calculate the dot product
>
> def F(a,b):
> return np.dot(a.T,b)
>
> I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).
>
> For smaller size of a and b. linalg.dot is working great. But I want a
> more efficient way using GPU.
>
> Perhaps, GPU isn't the way to go since the memory is too large?
>
>
>
>
>
> On Mon, Nov 23, 2015 at 2:26 PM, Stanley Seibert <stan(a)mtrr.org> wrote:
>> From the cuBLAS-XT description:
>>
>> (https://developer.nvidia.com/cublas)
>>
>> "By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.”
>>
>> So I don’t think cuBLAS-XT can help unless you have more than 95 GB of system RAM. If that is not the case, I think you have to step back and think about what you need to do with this array ultimately, and where you want to stage the data if you need to compute all 95 GB of it at once.
>>
>>
>>> On Nov 23, 2015, at 12:58 PM, Keith Brown <keith6014(a)gmail.com> wrote:
>>>
>>> Correct. My result matrix will be too large.
>>>
>>> <sigh>
>>>
>>> I would think cublasXT would take care of this for me. I though it
>>> would do some sort of divide and conquer.
>>>
>>> Is there a way to attack this sort of problem?
>>>
>>> On Mon, Nov 23, 2015 at 11:38 AM, Jonas Bardino <bardino(a)nbi.ku.dk> wrote:
>>>> Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
>>>> like you try to calculate the dot product of a 160080 x 3 matrix and a
>>>> similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
>>>> 160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
>>>> memory.
>>>>
>>>> Cheers, Jonas
>>>>
>>>> On 2015-11-23 17:10, Keith Brown wrote:
>>>>> I have a 2 small matrix (160080,3) of type float32 and I am
>>>>> calculating their dot product. While doing this, I keep getting
>>>>> pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
>>>>>
>>>>> I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
>>>>> kilobytes. I am not sure why this is occuring.
>>>>>
>>>>> x=np.ones((160080,3L)).astype(np.float32)
>>>>> a_gpu=gpuarray.to_gpu(x)
>>>>> b_gpu=gpuarray.to_gpu(x)
>>>>> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
>>>>>
>>>>> My handle is a cublasxt (not regular cublas since blasxt apprently
>>>>> does better memory handling).
>>>>>
>>>>> Any idea what is going on?
>>>>>
>>>>> _______________________________________________
>>>>> PyCUDA mailing list
>>>>> PyCUDA(a)tiker.net
>>>>> http://lists.tiker.net/listinfo/pycuda
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> PyCUDA mailing list
>>>> PyCUDA(a)tiker.net
>>>> http://lists.tiker.net/listinfo/pycuda
>>>>
>>>
>>> _______________________________________________
>>> PyCUDA mailing list
>>> PyCUDA(a)tiker.net
>>> http://lists.tiker.net/listinfo/pycuda
>>

does anyone have any thoughts? is this feasible?
On Mon, Nov 23, 2015 at 3:14 PM, Keith Brown <keith6014(a)gmail.com> wrote:
> Thanks all for the replies.
>
> My goal is simple. Atleast, I though it was simple :-)
>
> I have function where I calculate the dot product
>
> def F(a,b):
> return np.dot(a.T,b)
>
> I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).
>
> For smaller size of a and b. linalg.dot is working great. But I want a
> more efficient way using GPU.
>
> Perhaps, GPU isn't the way to go since the memory is too large?
>
>
>
>
>
> On Mon, Nov 23, 2015 at 2:26 PM, Stanley Seibert <stan(a)mtrr.org> wrote:
>> From the cuBLAS-XT description:
>>
>> (https://developer.nvidia.com/cublas)
>>
>> "By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.”
>>
>> So I don’t think cuBLAS-XT can help unless you have more than 95 GB of system RAM. If that is not the case, I think you have to step back and think about what you need to do with this array ultimately, and where you want to stage the data if you need to compute all 95 GB of it at once.
>>
>>
>>> On Nov 23, 2015, at 12:58 PM, Keith Brown <keith6014(a)gmail.com> wrote:
>>>
>>> Correct. My result matrix will be too large.
>>>
>>> <sigh>
>>>
>>> I would think cublasXT would take care of this for me. I though it
>>> would do some sort of divide and conquer.
>>>
>>> Is there a way to attack this sort of problem?
>>>
>>> On Mon, Nov 23, 2015 at 11:38 AM, Jonas Bardino <bardino(a)nbi.ku.dk> wrote:
>>>> Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
>>>> like you try to calculate the dot product of a 160080 x 3 matrix and a
>>>> similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
>>>> 160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
>>>> memory.
>>>>
>>>> Cheers, Jonas
>>>>
>>>> On 2015-11-23 17:10, Keith Brown wrote:
>>>>> I have a 2 small matrix (160080,3) of type float32 and I am
>>>>> calculating their dot product. While doing this, I keep getting
>>>>> pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
>>>>>
>>>>> I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
>>>>> kilobytes. I am not sure why this is occuring.
>>>>>
>>>>> x=np.ones((160080,3L)).astype(np.float32)
>>>>> a_gpu=gpuarray.to_gpu(x)
>>>>> b_gpu=gpuarray.to_gpu(x)
>>>>> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
>>>>>
>>>>> My handle is a cublasxt (not regular cublas since blasxt apprently
>>>>> does better memory handling).
>>>>>
>>>>> Any idea what is going on?
>>>>>
>>>>> _______________________________________________
>>>>> PyCUDA mailing list
>>>>> PyCUDA(a)tiker.net
>>>>> http://lists.tiker.net/listinfo/pycuda
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> PyCUDA mailing list
>>>> PyCUDA(a)tiker.net
>>>> http://lists.tiker.net/listinfo/pycuda
>>>>
>>>
>>> _______________________________________________
>>> PyCUDA mailing list
>>> PyCUDA(a)tiker.net
>>> http://lists.tiker.net/listinfo/pycuda
>>

Thanks all for the replies.
My goal is simple. Atleast, I though it was simple :-)
I have function where I calculate the dot product
def F(a,b):
return np.dot(a.T,b)
I need to do this 8k times. The max size of 'a' and 'b' are (3 million, 1).
For smaller size of a and b. linalg.dot is working great. But I want a
more efficient way using GPU.
Perhaps, GPU isn't the way to go since the memory is too large?
On Mon, Nov 23, 2015 at 2:26 PM, Stanley Seibert <stan(a)mtrr.org> wrote:
> From the cuBLAS-XT description:
>
> (https://developer.nvidia.com/cublas)
>
> "By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.”
>
> So I don’t think cuBLAS-XT can help unless you have more than 95 GB of system RAM. If that is not the case, I think you have to step back and think about what you need to do with this array ultimately, and where you want to stage the data if you need to compute all 95 GB of it at once.
>
>
>> On Nov 23, 2015, at 12:58 PM, Keith Brown <keith6014(a)gmail.com> wrote:
>>
>> Correct. My result matrix will be too large.
>>
>> <sigh>
>>
>> I would think cublasXT would take care of this for me. I though it
>> would do some sort of divide and conquer.
>>
>> Is there a way to attack this sort of problem?
>>
>> On Mon, Nov 23, 2015 at 11:38 AM, Jonas Bardino <bardino(a)nbi.ku.dk> wrote:
>>> Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
>>> like you try to calculate the dot product of a 160080 x 3 matrix and a
>>> similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
>>> 160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
>>> memory.
>>>
>>> Cheers, Jonas
>>>
>>> On 2015-11-23 17:10, Keith Brown wrote:
>>>> I have a 2 small matrix (160080,3) of type float32 and I am
>>>> calculating their dot product. While doing this, I keep getting
>>>> pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
>>>>
>>>> I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
>>>> kilobytes. I am not sure why this is occuring.
>>>>
>>>> x=np.ones((160080,3L)).astype(np.float32)
>>>> a_gpu=gpuarray.to_gpu(x)
>>>> b_gpu=gpuarray.to_gpu(x)
>>>> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
>>>>
>>>> My handle is a cublasxt (not regular cublas since blasxt apprently
>>>> does better memory handling).
>>>>
>>>> Any idea what is going on?
>>>>
>>>> _______________________________________________
>>>> PyCUDA mailing list
>>>> PyCUDA(a)tiker.net
>>>> http://lists.tiker.net/listinfo/pycuda
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> PyCUDA mailing list
>>> PyCUDA(a)tiker.net
>>> http://lists.tiker.net/listinfo/pycuda
>>>
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA(a)tiker.net
>> http://lists.tiker.net/listinfo/pycuda
>

Ehmm, I'm not sure I understand exactly what you do, but to me it sounds
like you try to calculate the dot product of a 160080 x 3 matrix and a
similar one transposed, i.e. a 3 x 160080 matrix. That would give you a
160080 x 160080 matrix result - which surely won't fit your 3GB of GPU
memory.
Cheers, Jonas
On 2015-11-23 17:10, Keith Brown wrote:
> I have a 2 small matrix (160080,3) of type float32 and I am
> calculating their dot product. While doing this, I keep getting
> pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
>
> I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
> kilobytes. I am not sure why this is occuring.
>
> x=np.ones((160080,3L)).astype(np.float32)
> a_gpu=gpuarray.to_gpu(x)
> b_gpu=gpuarray.to_gpu(x)
> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
>
> My handle is a cublasxt (not regular cublas since blasxt apprently
> does better memory handling).
>
> Any idea what is going on?
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>

>>> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
Isn't your output matrix of size 160080x160080?
Yiyin
On Mon, Nov 23, 2015 at 11:43 AM, Keith Brown <keith6014(a)gmail.com> wrote:
> I modified add_dot() to use cublas.xt.cublasXtSgemm. I don't think I
> need to modify dot() because its calling add_dot at the end. Its not
> calling cublasxt.cublasXtsgemm directly unless my matrix is 1d (which
> it isn't) Correct?
>
> BTW, smaller matrices work fine its just for larger matrices.
>
>
> On Mon, Nov 23, 2015 at 11:35 AM, Lev Givon <lev(a)columbia.edu> wrote:
> > Received from Keith Brown on Mon, Nov 23, 2015 at 11:10:45AM EST:
> >> I have a 2 small matrix (160080,3) of type float32 and I am
> >> calculating their dot product. While doing this, I keep getting
> >> pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
> >>
> >> I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
> >> kilobytes. I am not sure why this is occuring.
> >>
> >> x=np.ones((160080,3L)).astype(np.float32)
> >> a_gpu=gpuarray.to_gpu(x)
> >> b_gpu=gpuarray.to_gpu(x)
> >> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
> >>
> >> My handle is a cublasxt (not regular cublas since blasxt apprently
> >> does better memory handling).
> >>
> >> Any idea what is going on?
> >
> > Did you also modify skcuda.linalg.dot() to explicitly call the
> cublasXt*gemm
> > functions rather than the stock cublas*gemm functions? The cublasXt*gemm
> > functions expect host memory pointers as their arguments, not GPU memory
> > pointers.
> > --
> > Lev Givon
> > Bionet Group | Neurokernel Project
> > http://lebedov.github.io/
> > http://neurokernel.github.io/
> >
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>

Received from Keith Brown on Mon, Nov 23, 2015 at 11:10:45AM EST:
> I have a 2 small matrix (160080,3) of type float32 and I am
> calculating their dot product. While doing this, I keep getting
> pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
>
> I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
> kilobytes. I am not sure why this is occuring.
>
> x=np.ones((160080,3L)).astype(np.float32)
> a_gpu=gpuarray.to_gpu(x)
> b_gpu=gpuarray.to_gpu(x)
> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
>
> My handle is a cublasxt (not regular cublas since blasxt apprently
> does better memory handling).
>
> Any idea what is going on?
Did you also modify skcuda.linalg.dot() to explicitly call the cublasXt*gemm
functions rather than the stock cublas*gemm functions? The cublasXt*gemm
functions expect host memory pointers as their arguments, not GPU memory
pointers.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/http://neurokernel.github.io/

You are computing the product of a [160080, 3] and a [3, 160080] matrix,
so the result is a [160080, 160080] matrix. To store a matrix of that
size (as float32) you would need 95GB of RAM. That's a though fit for a
3GB GPU ;-)
On 2015-11-23 17:10, Keith Brown wrote:
> I have a 2 small matrix (160080,3) of type float32 and I am
> calculating their dot product. While doing this, I keep getting
> pycuda.__driver.MemoryError: cuMemAlloc failed out of memory.
>
> I have 2 cards, each with 3GB of memory. Each matrix takes about 1875
> kilobytes. I am not sure why this is occuring.
>
> x=np.ones((160080,3L)).astype(np.float32)
> a_gpu=gpuarray.to_gpu(x)
> b_gpu=gpuarray.to_gpu(x)
> c_gpu = linalg.dot(a_gpu,b_gpu,'N','T',handle=handle)
>
> My handle is a cublasxt (not regular cublas since blasxt apprently
> does better memory handling).
>
> Any idea what is going on?
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda