I am having some trouble with my Django/Celery/PyCuda setup. I am using
PyCuda for some image processing on a Amazon EC2 G2 instance. Here is the
info on my Cuda-capable GRID K520 card:
Detected 1 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 6.0 / 6.0
CUDA Capability Major/Minor version number: 3.0Total amount of
global memory: 4096 MBytes (4294770688 bytes)( 8)
Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Clock rate: 797 MHz (0.80
GHz)Memory Clock rate: 2500 MhzMemory Bus
L2 Cache Size: 524288 bytesMaximum
Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)Maximum Layered 1D Texture Size, (num) layers
1D=(16384), 2048 layersMaximum Layered 2D Texture Size, (num) layers
2D=(16384, 16384), 2048 layersTotal amount of constant memory:
65536 bytesTotal amount of shared memory per block: 49152
bytesTotal number of registers available per block: 65536Warp size:
32Maximum number of threads per
multiprocessor: 2048Maximum number of threads per block:
1024Max dimension size of a thread block (x,y,z): (1024, 1024, 64)Max
dimension size of a grid size (x,y,z): (2147483647, 65535,
65535)Maximum memory pitch: 2147483647
bytesTexture alignment: 512
bytesConcurrent copy and kernel execution: Yes with 2 copy
engine(s)Run time limit on kernels: NoIntegrated
GPU sharing Host Memory: NoSupport host page-locked memory
mapping: YesAlignment requirement for Surfaces:
YesDevice has ECC support: DisabledDevice
supports Unified Addressing (UVA): YesDevice PCI Bus ID / PCI
location ID: 0 / 3Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with
device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 6.0, CUDA
Runtime Version = 6.0, NumDevs = 1, Device0 = GRID K520Result = PASS
I am using a pretty out-of-the-box celery config. I have a set of tasks
defined in utils/tasks.py, which are tested and work before attempting to
use PyCuda. I installed PyCuda via pip.
At the top of the file that I am having trouble with, I do my standard
imports: from celery import task # other imports import os try: import
Image except Exception: from PIL import Image import time
#Cuda importsimport pycuda.autoinitimport pycuda.driver as cudafrom
pycuda.compiler import SourceModuleimport numpy
A remote server initiates a task, which follows this basic workflow:
print 'Got photo...'
... Do some stuff ...
result = do_photo_manipulation(photo_id)
im = Image.open(inPath)
px = numpy.array(im)
px = px.astype(numpy.float32)
d_px = cuda.mem_alloc(px.nbytes)
... (Do stuff with the pixel array) ...
This works if I run it in shell plus (ie, ./manage.py shell_plus) and if I
run it as a standalone, outside-of-django-and-celery process. It's only in
this context it fails, with the error: cuMemAlloc failed: not initialized
I have looked at other solutions for a while, and tried putting the import
statement to do the initialization in the function itself. I have also
plugged in a wait() statement, to ensure it's not a problem of the gpu
being ready to do work.
Here is an answer that suggests the error comes from not importing
pycuda.autoinit, which I have done:
The frustrating thing is that in a stand-alone python shell, pycuda behaves
appropriately. It is only in a Celery process that things break down.
Any help here would be appreciated!
If I need to provide any more information, just let me know!