Hello.
I've been packaging PyCUDA for Debian.
I run all the tests to ensure that package works on Python 2
and Python 3. All tests pass except for on from test_driver.py:
$ python test_driver.py
============================= test session starts
==============================
platform linux2 -- Python 2.7.5 -- pytest-2.3.5
collected 21 items
test_driver.py ........F............
=================================== FAILURES
===================================
_____________________ TestDriver.test_register_host_memory
_____________________
args = (<test_driver.TestDriver instance at 0x24e7d88>,), kwargs = {}
pycuda = <module 'pycuda' from
'/usr/lib/python2.7/dist-packages/pycuda/__init__.pyc'>
ctx = <pycuda._driver.Context object at 0x2504488>
clear_context_caches = <function clear_context_caches at 0x1dbf848>
collect = <built-in function collect>
def f(*args, **kwargs):
import pycuda.driver
# appears to be idempotent, i.e. no harm in calling it more than
once
pycuda.driver.init()
ctx = make_default_context()
try:
assert isinstance(ctx.get_device().name(), str)
assert isinstance(ctx.get_device().compute_capability(),
tuple)
assert isinstance(ctx.get_device().get_attributes(), dict)
> inner_f(*args, **kwargs)
/usr/lib/python2.7/dist-packages/pycuda/tools.py:434:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _
self = <test_driver.TestDriver instance at 0x24e7d88>
@mark_cuda_test
def test_register_host_memory(self):
if drv.get_version() < (4,):
from py.test import skip
skip("register_host_memory only exists on CUDA 4.0 and
later")
import sys
if sys.platform == "darwin":
from py.test import skip
skip("register_host_memory is not supported on OS X")
a = drv.aligned_empty((2**20,), np.float64, alignment=4096)
> drv.register_host_memory(a)
E LogicError: cuMemHostRegister failed: invalid value
test_driver.py:559: LogicError
==================== 1 failed, 20 passed in 116.85 seconds
=====================
This test fails both on ION (GeForce 9400M, CC 1.1) and GeForce 460
(CC 2.1). I've compiled PyCUDA with gcc 4.8, run with kernel 3.9
and drivers 304.88.
Regards.
--
Tomasz Rybak GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak
Hello,
I've got a host-side CUDA library wrapped in Cython and I'd like to use
it on a device-side array that I've allocated in python with PyCuda.
However, I'm completely at a loss as to how I should pass the device
pointers from the python side of things to the C library. In CUDA the
device pointers look like normal pointers, but in PyCuda I get a
DeviceAllocation object -- how can I pass the pointer address
information to the C library through Cython? What types should I use in
the Cython wrapper?
thanks!
Rok
Hi Alex,
Alex Nitz <alex.nitz(a)ligo.org> writes:
> I've noticed that a change made several months ago to the string error
> handling isn't compatible with versions of python earlier than 2.7.
>
> The following fails on python versions < 2.7.
>
> s = "test string"
> s.decode("UTF8", error='replace')
>
> as keywords were not supported at the time. I've attached a simple patch
> that makes these positional arguments.
>
> as in
>
> s.decode("UTF8", "replace")
Applied. Thanks for the patch.
Andreas
Hello,
I've noticed that a change made several months ago to the string error
handling isn't compatible with versions of python earlier than 2.7.
The following fails on python versions < 2.7.
s = "test string"
s.decode("UTF8", error='replace')
as keywords were not supported at the time. I've attached a simple patch
that makes these positional arguments.
as in
s.decode("UTF8", "replace")
Thanks,
Alex Nitz
Hi there,
I am using PyCUDA to render slices out of a 3D texture which are then passed to an OpenGL PBO for display on the screen. Everything is going well except that I cannot get my texture to use the address modes WRAP or MIRROR correctly, they produce exactly the same effect as CLAMP. I am confident that I am making a valid function call because when I use the BORDER mode I get the expected return value of 0 for all lookups outside the volume, but I can’t get WRAP or MIRROR working.
The calls simply look like this:
texture.set_address_mode(0, pycuda.driver.address_mode.WRAP)
I can fake the WRAP behaviour by using "x - floorf(x)” as the coord to lookup, but this is not as convenient as it is hardcoded into the kernel, not easily modified in the python side, and it doesn’t let me use MIRROR either.
Thanks in advance for any help you can give,
Ben
Hello Bogdan
Thank you very much for some interesting ideas.
The fact that you can run 8192 x 8192 on your C2050 clearly suggests that it was the limitation by my Quadro 2000.
I had a look on Reikna and it is indeed helpful.
And Ahmed,
I realised that creating a 2d array and making it into two separate sequentual 1D FFTs, one horizontal and the other vertical, does not yield the same result. Clearly 1D FFT and 2D FFT are different.
They have done the same in http://wiki.tiker.net/PyCuda/Examples/2DFFT . It is not 2D FFT but 1D FFT for each row and then reshaping it back to 2D. The result is not 2DFFT.
For my problem, I need to find FFT in 3D for an array of the range 1024 * 4096 *4096 using parallel computing by PyCUDA.
Is it necessary to write a kernel in C while writing the program or I can proceed the way I had sent in the previous mail? With my program, I can readily see 10x speedup compared to numpy fft. But my GPU is unable to handle huge data.
It will be really helpful if anyone can suggest any documentation/blogs/videos etc regarding it.
Thank you all.
Have a good day
Jayanth
> Date: Fri, 6 Dec 2013 16:47:17 +1100
> Subject: Re: [PyCUDA] cuMemAlloc failed: out of memory
> From: mantihor(a)gmail.com
> To: cv.jayanth(a)hotmail.com
> CC: wuzzyview(a)gmail.com; pycuda(a)tiker.net
>
> Hi Jayanth,
>
> I can run a 8192x8192 transform on a Tesla C2050 without problems. I
> think you are limited by the available video memory, see my previous
> message in this thread --- a 8192x4096 buffer takes 250Mb, and you
> have to factor in the temporary buffers PyFFT creates.
>
> By the way, I would recommend you to switch from PyFFT to Reikna
> (http://reikna.publicfields.net). PyFFT is not supported anymore, and
> Reikna includes its code along with some additional features and
> optimizations (more robust block/grid size finder, temporary array
> management, launch optimizations and so on). Your code would look
> like:
>
> import numpy
> import reikna.cluda as cluda
> from reikna.fft import FFT
>
> api = cluda.cuda_api()
> thr = api.Thread.create()
>
> # Or, if you want to use an external stream,
> #
> # cuda.init()
> # context = make_default_context()
> # stream = cuda.Stream()
> # thr = api.Thread(stream)
>
> data = numpy.ones((4096, 4096), dtype = numpy.complex64)
> gpu_data = thr.to_device(data) #converting to gpu array
>
> fft = FFT(data).compile(thr)
> fft(gpu_data, gpu_data)
> result = gpu_data.get()
>
> print result
>
>
> On Fri, Dec 6, 2013 at 3:43 PM, Jayanth Channagiri
> <cv.jayanth(a)hotmail.com> wrote:
> > Dear Ahmed
> >
> > Thank you for the resourceful reply.
> >
> > But the CUFFT limit is ~2^27 and also in the benchmarks on the CUFFT reach
> > upto 2^25. In my case, I am able to reach only upto 2^24. In some way, I am
> > missing another factor. Is this limited by my GPU's memory?
> > And also, in the same table, you can see for "Maximum width and height for a
> > 2D texture reference bound to a CUDA array " is 65000*65000 which is way too
> > high compared to mine. My GPU has a computing capacity of 2.x.
> > Thank you for the idea of performing two separate sequentual 1D FFTs. I will
> > shed more light on it. The thing is my problem doesn't stop at 2D. My goal
> > is to perform 3D FFT and I am not sure if I can calculate that way.
> >
> >
> > For others in the list, here I am sending the complete traceback of the
> > error message.
> > Traceback (most recent call last):
> > File "<stdin>", line 1, in <module>
> > File "/usr/lib/python2.7/dist-
> > packages/spyderlib/widgets/externalshell/sitecustomize.py", line 493, in
> > runfile
> > execfile(filename, namespace)
> > File "/home/jayanth/Dropbox/fft/fft1d_AB.py", line 99, in <module>
> > plan.execute(gpu_data)
> > File
> > "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py",
> > line 271, in _executeInterleaved
> > batch, data_in, data_out)
> > File
> > "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py",
> > line 192, in _execute
> > self._tempmemobj = self._context.allocate(buffer_size * 2)
> >
> > pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
> >
> > Also, here is the simple program to which I was addressing to calculate FFT
> > using pyfft :
> > from pyfft.cuda import Plan
> > import numpy
> > import pycuda.driver as cuda
> > from pycuda.tools import make_default_context
> > import pycuda.gpuarray as gpuarray
> >
> > cuda.init()
> > context = make_default_context()
> > stream = cuda.Stream()
> >
> > plan = Plan((4096, 4096), stream=stream) #creating the plan
> > data = numpy.ones((4096, 4096), dtype = numpy.complex64) #My data with just
> > ones to calculate the fft for single precision
> > gpu_data = gpuarray.to_gpu(data) #converting to gpu array
> > plan.execute(gpu_data)#calculating pyfft
> > result = gpu_data.get() #the result
> >
> > This is just a simple program to calculate the FFT for an array of 4096 *
> > 4096 in 2d. It works well for this array or a smaller array. As soon after I
> > increase it to the higher values like 8192*8192 or 8192*4096 or anything, it
> > gives an error message saying out of memory.
> > So I wanted to know the reason behind it and how to overcome.
> > You can execute the same code and kindly let me know if you have the same
> > limits in your respective GPUs.
> >
> > Thank you
> >
> >
> >
> > ________________________________
> > Date: Thu, 5 Dec 2013 20:27:45 -0500
> > Subject: Re: [PyCUDA] cuMemAlloc failed: out of memory
> > From: wuzzyview(a)gmail.com
> > To: cv.jayanth(a)hotmail.com
> > CC: pycuda(a)tiker.net
> >
> >
> > I ran into a similar issue:
> > http://stackoverflow.com/questions/13187443/nvidia-cufft-limit-on-sizes-and…
> >
> > The long and short of it is that CUFFT seems to have a limit of
> > approximately 2^27 elements that it can operate on, in any combination of
> > dimensions. In the StackOverflow post above, I was trying to make a plan for
> > large batches of the same 1D FFTs and hit this limitation. You'll also
> > notice that the benchmarks on the CUFFT site
> > https://developer.nvidia.com/cuFFT go up to sizes of 2^25.
> >
> > I hypothesize that this is related to the 2^27 "Maximum width for a 1D
> > texture reference bound to linear memory" limit that we see in Table 12 of
> > the CUDA C Programming Guide
> > http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities.
> >
> > So since 4096**2 is 2^24, increasing to 8096 by 8096 gets very close to the
> > limit, even though you'd think 2D FFTs would not be governed by the same
> > limits as 1D FFT batches.
> >
> > You should be able to achieve 8096 by 8096 and larger 2D FFTs by performing
> > two separate sequentual 1D FFTs, one horizontal and the other vertical. The
> > runtimes should nominally be the same (they are for CPU FFTs), and the
> > answer will be the same, up to machine precision.
> >
> >
> > On Thu, Dec 5, 2013 at 9:53 AM, Jayanth Channagiri <cv.jayanth(a)hotmail.com>
> > wrote:
> >
> > Hello
> >
> > I have a NVIDIA 2000 GPU. It has 192 CUDA cores and 1 Gb memory.
> > GB GDDR5
> >
> > I am trying to calculate fft by GPU using pyfft.
> > I am able to calculate the fft only upto the array with maximum of 4096 x
> > 4096.
> >
> > But as soon after I increase the array size, it gives an error message
> > saying:
> > pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
> >
> > Can anyone please tell me if this error means that my GPU is not sufficient
> > to calculate this array? Or is it my computer's memory? Or a programming
> > error? What is the maximum array size you can achieve with GPU?
> > Is there any information of how else can I calculate the huge arrays?
> >
> > Thank you very much in advance for the help and sorry if it is too
> > preliminary question.
> >
> > Jayanth
> >
> >
> >
> >
> >
> > _______________________________________________
> > PyCUDA mailing list
> > PyCUDA(a)tiker.net
> > http://lists.tiker.net/listinfo/pycuda
> >
> >
> >
> > _______________________________________________
> > PyCUDA mailing list
> > PyCUDA(a)tiker.net
> > http://lists.tiker.net/listinfo/pycuda
> >
On Fri, 6 Dec 2013 16:47:17 +1100
Bogdan Opanchuk <mantihor(a)gmail.com> wrote:
> I can run a 8192x8192 transform on a Tesla C2050 without problems. I
> think you are limited by the available video memory, see my previous
> message in this thread --- a 8192x4096 buffer takes 250Mb, and you
> have to factor in the temporary buffers PyFFT creates.
I confirm having ran a FFT 3D on 450x450x450 on a GeForce Titan (6GB)
using scikit.cuda.
Cheers,
--
Jérôme Kieffer
Data analysis unit - ESRF
Hi oyster,
I have fixed two things in order to make your program runnable:
- replaced 'numPoint.x' and 'numPoint.y' with 'numPointX' and 'numPointY',
- added 'startTime = time.time()' line before the kernel call
There are the following problems with the code:
- The shape of 'iter' is incorrect: you are addressing it as if it was
(numPointY, numPointX), but it has the shape (numPointX, numPointY).
It is not the cause of the launch failure, but will probably give you
incorrect results.
- When you calculate 'offset' in the kernel, you are multiplying
'yIdx' by the total width of the grid ('blockDim.x*gridDim.x'), but
your array is continuous, so you need to multiply by its actual
dimension ('numPointx') instead.
After fixing these, your program runs without errors.
Best regards,
Bogdan
On Sat, Dec 7, 2013 at 1:58 AM, oyster <lepto.python(a)gmail.com> wrote:
> Hi, there. Can anyone help me to fix my code, thanks
>
> I want to draw 2D function picture of F(x, y). The thought behind it
> is very simple: to calculate F(x,y) on every pixels(xi, yi), if F(xi,
> yi)<=eps, we put this pixel in BLACK color
>
> In the following is the code:
> startCordx and startCordy means the start point to begin search
> numPointx and numPointy means how many points that we devided x and y into
> divx and divy means the smallest step we used to increase
> iter holds our returned array
>
> however my code says
> [quote]
> File "e:\prg\py\python-2.7.3\lib\site-packages\pycuda\driver.py",
> line 377, in function_call Context.synchronize()
> pycuda._driver.LaunchError: cuCtxSynchronize failed: launch failed
> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
> cuMemFree failed: launch failed
> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
> cuModuleUnload failed: launch failed
> [/quote]
>
>
> [python code]
> #coding=utf-8
> import PIL.Image as Image
> import time
> import numpy
> import PIL.ImageOps as ImageOps
> import pycuda.gpuarray as gpuarray
> import pycuda.autoinit
> import pycuda.driver as drv
> from pycuda.compiler import SourceModule
>
> eps= 5
>
> divX=divY=0.002
>
> startCordX, startCordY=-5, -5
> endCordX, endCordY=5, 6
>
> numPointX=int((endCordX-startCordX)/divX+1)
> numPointY=int((endCordY-startCordY)/divY+1)
>
> # allocate a numpy array
> iter = numpy.ones((numPointX, numPointY)).astype(numpy.uint8)*0xff
>
>
> mod = SourceModule("""
> __global__ void multiply_them(
> int startCordx, int startCordy,
> unsigned int numPointx, unsigned int numPointy,
> float divx, float divy,
> unsigned char eps,
> unsigned char *iter)
> {
>
> const unsigned int xIdx = threadIdx.x+blockIdx.x*blockDim.x;
> const unsigned int yIdx = threadIdx.y+blockIdx.y*blockDim.y;
>
> unsigned int offset=xIdx+yIdx*blockDim.x*gridDim.x;
>
> float x=startCordx + xIdx * divx;
> float y=startCordy + yIdx * divy;
>
> if ((xIdx<numPointx)&&(yIdx<numPointy))
> {
> if (
> abs((17*x*x-16*abs(x)*y+17*y*y-255))<=eps
> )
> {
> iter[offset]=0;
> }
> else
> {
> iter[offset]=255;
> }
> }
> }
> """)
>
> multiply_them = mod.get_function("multiply_them")
>
> multiply_them(
> numpy.int32(startCordX), numpy.int32(startCordY),
> numpy.uint32(numPointX), numpy.uint32(numPointY),
> numpy.float32(divX), numpy.float32(divY),
> numpy.uint8(eps),
> drv.InOut(iter),
> grid=((numPoint.x+15)//16,(numPoint.y+15)//16,),
> block=(16,16,1)
> )
>
> endTime=time.time()
> print 'Time used: %.2f seconds' % (endTime-startTime)
>
> img=Image.fromarray(iter, mode='L')
> img=ImageOps.flip(img)
>
> img.show()
> [/python code]
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
Hi, there. Can anyone help me to fix my code, thanks
I want to draw 2D function picture of F(x, y). The thought behind it
is very simple: to calculate F(x,y) on every pixels(xi, yi), if F(xi,
yi)<=eps, we put this pixel in BLACK color
In the following is the code:
startCordx and startCordy means the start point to begin search
numPointx and numPointy means how many points that we devided x and y into
divx and divy means the smallest step we used to increase
iter holds our returned array
however my code says
[quote]
File "e:\prg\py\python-2.7.3\lib\site-packages\pycuda\driver.py",
line 377, in function_call Context.synchronize()
pycuda._driver.LaunchError: cuCtxSynchronize failed: launch failed
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: launch failed
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuModuleUnload failed: launch failed
[/quote]
[python code]
#coding=utf-8
import PIL.Image as Image
import time
import numpy
import PIL.ImageOps as ImageOps
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
eps= 5
divX=divY=0.002
startCordX, startCordY=-5, -5
endCordX, endCordY=5, 6
numPointX=int((endCordX-startCordX)/divX+1)
numPointY=int((endCordY-startCordY)/divY+1)
# allocate a numpy array
iter = numpy.ones((numPointX, numPointY)).astype(numpy.uint8)*0xff
mod = SourceModule("""
__global__ void multiply_them(
int startCordx, int startCordy,
unsigned int numPointx, unsigned int numPointy,
float divx, float divy,
unsigned char eps,
unsigned char *iter)
{
const unsigned int xIdx = threadIdx.x+blockIdx.x*blockDim.x;
const unsigned int yIdx = threadIdx.y+blockIdx.y*blockDim.y;
unsigned int offset=xIdx+yIdx*blockDim.x*gridDim.x;
float x=startCordx + xIdx * divx;
float y=startCordy + yIdx * divy;
if ((xIdx<numPointx)&&(yIdx<numPointy))
{
if (
abs((17*x*x-16*abs(x)*y+17*y*y-255))<=eps
)
{
iter[offset]=0;
}
else
{
iter[offset]=255;
}
}
}
""")
multiply_them = mod.get_function("multiply_them")
multiply_them(
numpy.int32(startCordX), numpy.int32(startCordY),
numpy.uint32(numPointX), numpy.uint32(numPointY),
numpy.float32(divX), numpy.float32(divY),
numpy.uint8(eps),
drv.InOut(iter),
grid=((numPoint.x+15)//16,(numPoint.y+15)//16,),
block=(16,16,1)
)
endTime=time.time()
print 'Time used: %.2f seconds' % (endTime-startTime)
img=Image.fromarray(iter, mode='L')
img=ImageOps.flip(img)
img.show()
[/python code]
Dear Ahmed
Thank you for the resourceful reply.
But the CUFFT limit is ~2^27 and also in the benchmarks on the CUFFT reach upto 2^25. In my case, I am able to reach only upto 2^24. In some way, I am missing another factor. Is this limited by my GPU's memory?
And also, in the same table, you can see for "Maximum width and height for a 2D texture reference bound to a CUDA
array
" is 65000*65000 which is way too high compared to mine. My GPU has a computing capacity of 2.x.
Thank you for the idea of performing two separate sequentual 1D FFTs. I will shed more light on it. The thing is my problem doesn't stop at 2D. My goal is to perform 3D FFT and I am not sure if I can calculate that way.
For others in the list, here I am sending the complete traceback of the error message.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-
packages/spyderlib/widgets/externalshell/sitecustomize.py", line 493, in runfile
execfile(filename, namespace)
File "/home/jayanth/Dropbox/fft/fft1d_AB.py", line 99, in <module>
plan.execute(gpu_data)
File "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py", line 271, in _executeInterleaved
batch, data_in, data_out)
File "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py", line 192, in _execute
self._tempmemobj = self._context.allocate(buffer_size * 2)
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
Also, here is the simple program to which I was addressing to calculate FFT using pyfft :
from pyfft.cuda import Plan
import numpy
import pycuda.driver as cuda
from pycuda.tools import make_default_context
import pycuda.gpuarray as gpuarray
cuda.init()
context = make_default_context()
stream = cuda.Stream()
plan = Plan((4096, 4096), stream=stream) #creating the plan
data = numpy.ones((4096, 4096), dtype = numpy.complex64) #My data with just ones to calculate the fft for single precision
gpu_data = gpuarray.to_gpu(data) #converting to gpu array
plan.execute(gpu_data)#calculating pyfft
result = gpu_data.get() #the result
This is just a simple program to calculate the FFT for an array of 4096 *
4096 in 2d. It works well for this array or a smaller array. As soon
after I increase it to the higher values like 8192*8192 or 8192*4096 or
anything, it gives an error message saying
out of memory.
So I wanted to know the reason behind it and how to overcome.
You can execute the same code and kindly let me know if you have the same limits in your respective GPUs.
Thank you
Date: Thu, 5 Dec 2013 20:27:45 -0500
Subject: Re: [PyCUDA] cuMemAlloc failed: out of memory
From: wuzzyview(a)gmail.com
To: cv.jayanth(a)hotmail.com
CC: pycuda(a)tiker.net
I ran into a similar issue: http://stackoverflow.com/questions/13187443/nvidia-cufft-limit-on-sizes-and…
The long and short of it is that CUFFT seems to have a limit of approximately 2^27 elements that it can operate on, in any combination of dimensions. In the StackOverflow post above, I was trying to make a plan for large batches of the same 1D FFTs and hit this limitation. You'll also notice that the benchmarks on the CUFFT site https://developer.nvidia.com/cuFFT go up to sizes of 2^25.
I hypothesize that this is related to the 2^27 "Maximum width for a 1D texture reference bound to linear memory" limit that we see in Table 12 of the CUDA C Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities.
So since 4096**2 is 2^24, increasing to 8096 by 8096 gets very close to the limit, even though you'd think 2D FFTs would not be governed by the same limits as 1D FFT batches.
You should be able to achieve 8096 by 8096 and larger 2D FFTs by performing two separate sequentual 1D FFTs, one horizontal and the other vertical. The runtimes should nominally be the same (they are for CPU FFTs), and the answer will be the same, up to machine precision.
On Thu, Dec 5, 2013 at 9:53 AM, Jayanth Channagiri <cv.jayanth(a)hotmail.com> wrote:
Hello
I have a NVIDIA 2000 GPU. It has 192 CUDA cores and 1 Gb memory. GB GDDR5
I am trying to calculate fft by GPU using pyfft.
I am able to calculate the fft only upto the array with maximum of 4096 x 4096.
But as soon after I increase the array size, it gives an error message saying:
pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
Can anyone please tell me if this error means that my GPU is not sufficient to calculate this array? Or is it my computer's memory? Or a programming error? What is the maximum array size you can achieve with GPU?
Is there any information of how else can I calculate the huge arrays?
Thank you very much in advance for the help and sorry if it is too preliminary question.
Jayanth
_______________________________________________
PyCUDA mailing list
PyCUDA(a)tiker.net
http://lists.tiker.net/listinfo/pycuda