Hi all,
I am looking for a solution for exchanging some tensors between two gpus,
that do not have P2P enabled. Assuming two GPUs on the same node, I guess I
have to do it in two steps; first copy to host memory from GPU (gpu-0) and
then copy from host memory to the other GPU (gpu-1). However it is not
exactly clear to me as to how I can go about this.
Any help is appreciated.
Thanks
- Baskaran

I am trying to learn the
http://wiki.tiker.net/PyCuda/Examples/MatrixmulSimple and its working
so far but for only smaller size matrix. When I increase the size of
the matrix the CPU and GPU values diverge as far as 5.9e+01.
I suspect its due to block and grid parameters I need to pass to
matrixmul(). Is that correct? How can I pick the most optimal values?
Or is there something else I should be considering?
My matrix size is 10000x3

Received from Keith Brown on Sun, Nov 08, 2015 at 11:46:47PM EST:
> Thanks Lev.
> My matrix size is going to be large, somewhere near n=100000.
(I assume n = total number of elements in the matrix; a matrix of size 10**5 x
10**5 32-bit floating point values would require more memory than currently
available GPUs can provide.)
> So, how can I test between CPU and GPU matrix math? I though my
> technique was good enough but apparently not.
If you are trying to ensure that the CPU and GPU are doing as similar floating
point computations as possible, you may want to look into whether the intrinsic
single precision functions that CUDA provides to enable control of rounding
during addition and multiplication (e.g., __fadd_rd, __fad_rn, etc.) may be
useful, as well as compiler options that affect processing of denormals (e.g.,
--ftz). For the purposes of checking algorithmic correctness against an existing
(CPU-based) implementation, you may want to use double precision (even if you
plan to use single precision for your actual computations). In general, though,
it is prudent to test results (via allclose()) with some defined tolerance in
light of the effects of floating point operations.
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/http://neurokernel.github.io/

Received from Keith Brown on Sun, Nov 08, 2015 at 10:23:18PM EST:
> I have several thousand matrices where I need to calculate their dot
> product. So, it seems pyCuda should do the trick (i hope). I am
> running into an issue with block sizes.
>
> Here is my code
>
> #!/usr/bin/env python
> import sys,time
> from string import Template
> import numpy as np
> from pycuda import driver, compiler, gpuarray, tools
> from pycuda.compiler import SourceModule
> import pycuda.autoinit
>
>
> def main():
> d={}
> size=4
> d['size']=size
>
> src=Template("""
> __global__ void MatrixMulKernel(float *a, float *b, float *c)
> {
> int tx = threadIdx.x;
> int ty = threadIdx.y;
> float Pvalue = 0;
>
> for (int k = 0; k < $size; ++k) {
> float Aelement = a[ty * $size + k];
> float Belement = b[k * $size + tx];
> Pvalue += Aelement * Belement;
> }
> c[ty * $size + tx] = Pvalue;
>
> }
> """)
>
> #src.substitute(d)
>
> a_cpu = np.random.randn(size,size).astype(np.float32)
> b_cpu = np.random.randn(size,size).astype(np.float32)
>
> a_gpu=gpuarray.to_gpu(a_cpu)
> b_gpu=gpuarray.to_gpu(b_cpu)
> c_gpu = gpuarray.empty((size,size), np.float32)
>
> src.substitute(d)
> mod = compiler.SourceModule(src.substitute(d))
> mm=mod.get_function("MatrixMulKernel")
> v=mm(a_gpu,b_gpu,c_gpu,
> block=(16,16,1),
> )
> start=time.time()
>
> gpu_ans=c_gpu.get()
> stop=time.time()
> print "Gpu",stop-start
>
> start=time.time()
> cpu_ans=np.dot(a_cpu,b_cpu)
> stop=time.time()
> print "Cpu",stop-start
>
>
> #print gpu_ans
> #print cpu_ans
> print np.allclose(gpu_ans,cpu_ans,atol=1e-03)
>
>
> Couple of issues:
> When I increase size of matrix it seems it gets less accurate than CPU
> dot product therefore I have to use np.allclose to compare.
It isn't necessary clear that the CPU answer is "more accurate"; since the
summations performed on the GPU may occur in a different order than those on the
CPU and since floating point addition is not associative, the difference between
the GPU and CPU results may become more pronounced for the larger summations
required when computing the dot product of large matrices.
> Also, what is the optimal block size I should be using?
It depends on your matrix size; you generally want to set the block (and grid)
size to maximize the number of threads active at a specific time.
If your matrices are very small (4 x 4), it isn't clear that using the GPU will
save you much time compared to using numpy because of the cost of copying the
matrices to and from GPU memory.
Note that if you are dealing with large matrices, you may wish to check out the
CUBLAS functions for matrix multiplication; a dot() function that uses those
functions is available in scikit-cuda [1], although the Python code that makes
the function easy to use may impose some noticeable overhead if you plan to
invoke it several thousand times.
[1] http://scikit-cuda.rtfd.org
--
Lev Givon
Bionet Group | Neurokernel Project
http://lebedov.github.io/http://neurokernel.github.io/

I have several thousand matrices where I need to calculate their dot
product. So, it seems pyCuda should do the trick (i hope). I am
running into an issue with block sizes.
Here is my code
#!/usr/bin/env python
import sys,time
from string import Template
import numpy as np
from pycuda import driver, compiler, gpuarray, tools
from pycuda.compiler import SourceModule
import pycuda.autoinit
def main():
d={}
size=4
d['size']=size
src=Template("""
__global__ void MatrixMulKernel(float *a, float *b, float *c)
{
int tx = threadIdx.x;
int ty = threadIdx.y;
float Pvalue = 0;
for (int k = 0; k < $size; ++k) {
float Aelement = a[ty * $size + k];
float Belement = b[k * $size + tx];
Pvalue += Aelement * Belement;
}
c[ty * $size + tx] = Pvalue;
}
""")
#src.substitute(d)
a_cpu = np.random.randn(size,size).astype(np.float32)
b_cpu = np.random.randn(size,size).astype(np.float32)
a_gpu=gpuarray.to_gpu(a_cpu)
b_gpu=gpuarray.to_gpu(b_cpu)
c_gpu = gpuarray.empty((size,size), np.float32)
src.substitute(d)
mod = compiler.SourceModule(src.substitute(d))
mm=mod.get_function("MatrixMulKernel")
v=mm(a_gpu,b_gpu,c_gpu,
block=(16,16,1),
)
start=time.time()
gpu_ans=c_gpu.get()
stop=time.time()
print "Gpu",stop-start
start=time.time()
cpu_ans=np.dot(a_cpu,b_cpu)
stop=time.time()
print "Cpu",stop-start
#print gpu_ans
#print cpu_ans
print np.allclose(gpu_ans,cpu_ans,atol=1e-03)
Couple of issues:
When I increase size of matrix it seems it gets less accurate than CPU
dot product therefore I have to use np.allclose to compare.
Also, what is the optimal block size I should be using?

Hi all,
I am working under linux Fedora 22 and a Nvidia GTX 260. Driver 340.93
and cuda 6-5 are installed, which are the latest supported versions for
this graphic card. I am using python 2.7.10 and I have installed the
full scipy stack through yum.
I am getting troubles with pycuda 2015.1.3 and pycuda.autoinit :
>>> import pycuda.autoinit
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python2.7/site-packages/pycuda/autoinit.py", line
2, in <module>
import pycuda.driver as cuda
File "/usr/lib64/python2.7/site-packages/pycuda/driver.py", line 5,
in <module>
from pycuda._driver import * # noqa
ImportError: /usr/lib64/python2.7/site-packages/pycuda/_driver.so:
undefined symbol: PyUnicodeUCS2_AsWideChar
The same error is not encountered with pycuda 2015.1.2 which seems to
work fine.
Cheers
GuL