Hi PyCUDA community and author, thanks for your continued fantastic support. Today I am running into an unexpected problem when I try to bind a big chunk of memory to a texture. I've reduced the problem to the following stand-alone illustrative example, but I apologize, because the example contains some specifics about my usecase (float2 data, 2D texture---the end use of all this is to do matrix filtering):

### code starts
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule

# borrowed from PyCUDA test
mod = SourceModule("""
        #define CHANNELS 2
        texture<float2, 2, cudaReadModeElementType> mtx_tex;
        __global__ void copy_texture(float *dest)
          int row = threadIdx.x;
          int col = threadIdx.y;
          int w = blockDim.y;
          float2 texval = tex2D(mtx_tex, row, col);
          dest[(row*w+col)*CHANNELS + 0] = texval.x;
          dest[(row*w+col)*CHANNELS + 1] = texval.y;
mtx_tex = mod.get_texref("mtx_tex")

width = 32*1024
height = 4330
nbytes = width * height * 2 * 4 # 2 floats per float2, 4 bytes per float
gpu = cuda.mem_alloc(nbytes)
HACK = 3
mtx_tex.set_address(gpu, nbytes / HACK)

### code ends

The above code runs on my Telsa C2050, for values of "HACK" >= 3, that is, asking set_address() to bind only one-third (or less) of the device memory needed into the texture unit. I need it to work for HACK=1, but I get an error:

### error begins
Traceback (most recent call last):
  File "texbind.py", line 26, in <module>
    mtx_tex.set_address(gpu, nbytes / HACK)
pycuda._driver.LogicError: cuTexRefSetAddress failed: invalid value
### error ends

I have verified (using the deviceQuery SDK demo) that these sizes are not too large for this C2050, and I note that I can seem to bind the texture to freshly-copied data, i.e., I can successfully run the following without error after the above:

### begin code
import numpy
data = (numpy.ones((height,width)) + 1j * numpy.ones((height,width))).astype(numpy.complex64)

carr =  cuda.make_multichannel_2d_array(numpy.asarray(numpy.concatenate(
    (data.real[:,:,numpy.newaxis], data.imag[:,:,numpy.newaxis]),2), 
    order='C'), order='C')
cuda.bind_array_to_texref(carr, mtx_tex)
### end code

The above's not the prettiest way to load a complex array from host into a 2D float2 texture but it doesn't produce any errors and theoretically allocates the same amount of memory as in the first code example (NB: I haven't verified the texture's contents with a kernel yet).

Any hints as to what I'm doing wrong?