GPUs are good at crunching numbers, but only provided that you can give
them enough numbers to crunch to hide the latency of memory access.
GPUs are only efficient if your problem exhibits high arithmetic intensity,
which can be loosely defined as the number of arithmetic operations per
read/written datum.
Sadly, dot product is not such a problem: just one product and an addition
for every two operands; the sum does not help either because it cannot be
that well parallelized.
Regards,
Daniel
On Tue, 25 Apr 2017 at 21:50, archana sapkota <archanasapkota(a)gmail.com>
wrote:
> Hello,
> I just started working with PyCUDA. Basically whole CUDA is new to me. I
> was trying to get to use the GPU to compute dot products of a large number
> of vectors. Cause it was taking several days using multiple CPU cores.
>
> But with my first try, I am sad that I did not see the boost in speed.
> Here is a piece of code that I am currently running. This is just to see
> how much speedup I will be getting. My vector of interest has a dimension
> of around "3000". So eventually I will be computing dot product ( or L2
> norm) of those vectors.
>
> I would highly appreciate if someone could suggest what I am missing and
> how I could achieve my goal.
>
> I also see some difference in results on numpy and on GPUs. Not as big a
> concern right now but I am curious why.
>
> Here is a sample code I m working with:
>
> import pycuda.gpuarray as gpuarray
> import pycuda.reduction as reduction
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import numpy
> import time
>
>
> krnl = reduction.ReductionKernel(numpy.float32, neutral="0",
> reduce_expr="a+b", map_expr="x[i]*y[i]",
> arguments="float *x, float *y")
> ssd = reduction.ReductionKernel(numpy.float32, neutral="0",
> reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])",
> arguments="float *x, float *y")
>
> for i in range(10):
> a = numpy.random.randn(3000)
> b = numpy.random.randn(3000)
>
> a_gpu = gpuarray.to_gpu(a.astype(numpy.float32))
> b_gpu = gpuarray.to_gpu(b.astype(numpy.float32))
>
> start = time.time()
> numpy_dot = numpy.dot(a,b)
> end = time.time()
> dt = end - start
>
> print ("CPU time", dt)
> print ("numpy_dot", numpy_dot)
> print ("numpy_euclid", numpy_ssd)
>
> start = time.time()
> my_dot_prod = krnl(a_gpu, b_gpu).get()
> end = time.time()
>
>
> dt = end - start
> print ("GPU time", dt)
> print ("my dot product", my_dot_prod)
> print ("my euclid", my_euclid)
> print ("\n")
>
>
> Example timings are:
> CPU time 5.9604644775390625e-06
> numpy_dot -19.7736554062 <(773)%20655-4062>
> numpy_ssd 5975.41368065
> GPU time 0.0009388923645019531
> my dot product -19.77365493774414
> my ssd 5975.4140625
>
>
> Thanks,
> Arch
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> https://lists.tiker.net/listinfo/pycuda
>
On Tue, Apr 25, 2017 at 3:49 PM, archana sapkota
<archanasapkota(a)gmail.com> wrote:
> Hello,
> I just started working with PyCUDA. Basically whole CUDA is new to me. I was
> trying to get to use the GPU to compute dot products of a large number of
> vectors. Cause it was taking several days using multiple CPU cores.
>
> But with my first try, I am sad that I did not see the boost in speed. Here
> is a piece of code that I am currently running. This is just to see how much
> speedup I will be getting. My vector of interest has a dimension of around
> "3000". So eventually I will be computing dot product ( or L2 norm) of those
> vectors.
>
> I would highly appreciate if someone could suggest what I am missing and how
> I could achieve my goal.
>
> I also see some difference in results on numpy and on GPUs. Not as big a
> concern right now but I am curious why.
>
> Here is a sample code I m working with:
>
> import pycuda.gpuarray as gpuarray
> import pycuda.reduction as reduction
> import pycuda.driver as cuda
> import pycuda.autoinit
> from pycuda.compiler import SourceModule
> import numpy
> import time
>
>
> krnl = reduction.ReductionKernel(numpy.float32, neutral="0",
> reduce_expr="a+b", map_expr="x[i]*y[i]",
> arguments="float *x, float *y")
> ssd = reduction.ReductionKernel(numpy.float32, neutral="0",
> reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])",
> arguments="float *x, float *y")
>
> for i in range(10):
> a = numpy.random.randn(3000)
> b = numpy.random.randn(3000)
>
> a_gpu = gpuarray.to_gpu(a.astype(numpy.float32))
> b_gpu = gpuarray.to_gpu(b.astype(numpy.float32))
>
> start = time.time()
> numpy_dot = numpy.dot(a,b)
> end = time.time()
> dt = end - start
>
> print ("CPU time", dt)
> print ("numpy_dot", numpy_dot)
> print ("numpy_euclid", numpy_ssd)
>
> start = time.time()
> my_dot_prod = krnl(a_gpu, b_gpu).get()
> end = time.time()
>
>
> dt = end - start
> print ("GPU time", dt)
> print ("my dot product", my_dot_prod)
> print ("my euclid", my_euclid)
> print ("\n")
>
>
> Example timings are:
> CPU time 5.9604644775390625e-06
> numpy_dot -19.7736554062
> numpy_ssd 5975.41368065
> GPU time 0.0009388923645019531
> my dot product -19.77365493774414
> my ssd 5975.4140625
>
>
> Thanks,
> Arch
Several points:
- The first time you invoke the kernel will be slower than subsequent
invocations because of the time taken to compile the kernel.
- Owing to the relatively low bandwidth of GPU to host memory
transfers, you will probably not see any overall speedup for
relatively short vectors such as those you are processing if you are
loading a new vector into GPU memory at every iteration. You probably
will see better performance processing your vectors in parallel on the
CPU using something like Python's multiprocessing module or dask
distributed (https://distributed.readthedocs.io/en/latest/).
- Since you are using single precision floats, you will see
differences in the CUDA/numpy results because of internal
implementation differences.
--
Lev E. Givon, PhD
http://lebedov.github.io
Hello,
I just started working with PyCUDA. Basically whole CUDA is new to me. I
was trying to get to use the GPU to compute dot products of a large number
of vectors. Cause it was taking several days using multiple CPU cores.
But with my first try, I am sad that I did not see the boost in speed. Here
is a piece of code that I am currently running. This is just to see how
much speedup I will be getting. My vector of interest has a dimension of
around "3000". So eventually I will be computing dot product ( or L2 norm)
of those vectors.
I would highly appreciate if someone could suggest what I am missing and
how I could achieve my goal.
I also see some difference in results on numpy and on GPUs. Not as big a
concern right now but I am curious why.
Here is a sample code I m working with:
import pycuda.gpuarray as gpuarray
import pycuda.reduction as reduction
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
import time
krnl = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="float *x, float *y")
ssd = reduction.ReductionKernel(numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="(x[i] - y[i])*(x[i] - y[i])",
arguments="float *x, float *y")
for i in range(10):
a = numpy.random.randn(3000)
b = numpy.random.randn(3000)
a_gpu = gpuarray.to_gpu(a.astype(numpy.float32))
b_gpu = gpuarray.to_gpu(b.astype(numpy.float32))
start = time.time()
numpy_dot = numpy.dot(a,b)
end = time.time()
dt = end - start
print ("CPU time", dt)
print ("numpy_dot", numpy_dot)
print ("numpy_euclid", numpy_ssd)
start = time.time()
my_dot_prod = krnl(a_gpu, b_gpu).get()
end = time.time()
dt = end - start
print ("GPU time", dt)
print ("my dot product", my_dot_prod)
print ("my euclid", my_euclid)
print ("\n")
Example timings are:
CPU time 5.9604644775390625e-06
numpy_dot -19.7736554062 <(773)%20655-4062>
numpy_ssd 5975.41368065
GPU time 0.0009388923645019531
my dot product -19.77365493774414
my ssd 5975.4140625
Thanks,
Arch
Thanks, Vedran for reply.
But this leads to change in programming platform.
I want to take advantage of python libraries(that is why PyCuda API).
Is there anyway? Without changing the language I can achieve the same?
Regards
Sahil Gupta
On Tue, Apr 18, 2017 at 7:06 PM, Vedran Miletić <rivanvx(a)gmail.com> wrote:
> You might want to look at:
>
> Swenson, Brian Paul, and George F. Riley. "Simulating large topologies in
> ns-3 using BRITE and CUDA driven global routing." *Proceedings of the 6th
> International ICST Conference on Simulation Tools and Techniques*. ICST
> (Institute for Computer Sciences, Social-Informatics and Telecommunications
> Engineering), 2013.
>
> Regards,
> Vedran
>
> sri, 19. tra 2017. u 00:57 Sahil Gupta <sg5414(a)rit.edu> napisao je:
>
>> Hi all,
>> I am new to PyCuda API.
>>
>> Here what I want to do:
>> 1. Install BGP tables from source website to my cluster node
>> 2. Process all data and store in Database.
>> 3. Run AS-level pathfinding the algorithm to get paths for
>> particular destination prefix.
>>
>> It is a networking project under the topic Internet Cartography.
>>
>> Now,
>> I need help in step 1 and 2.
>> Idea is that BGP route files should be accessible to GPU nodes and so the
>> MySQL database that store data.
>> For each prefix, GPU device will run algorithm separately where it access
>> the database.
>>
>> Now my question is:
>> 1. Can I have shared a database, for each GPU node and if yes, how can I
>> implement in PyCuda.
>> 2. Can I have a separate database, on each GPU node? If yes, how can I
>> implement in PyCuda? I will prefer this against step 1 due to communication
>> overhead in a database query.
>> 3. How can I share files among all GPU nodes in PyCuda?
>> 4. Can I have the share read access to all GPU nodes in PyCuda?
>>
>> Do let me know if any point is not clear.
>> Waiting for reply.
>> I have high hopes with this group for my project.
>>
>> Cheers
>> Sahil Gupta
>>
>>
>>
>>
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA(a)tiker.net
>> https://lists.tiker.net/listinfo/pycuda
>>
> --
> Vedran Miletić
> vedran.miletic.net
>
You might want to look at:
Swenson, Brian Paul, and George F. Riley. "Simulating large topologies in
ns-3 using BRITE and CUDA driven global routing." *Proceedings of the 6th
International ICST Conference on Simulation Tools and Techniques*. ICST
(Institute for Computer Sciences, Social-Informatics and Telecommunications
Engineering), 2013.
Regards,
Vedran
sri, 19. tra 2017. u 00:57 Sahil Gupta <sg5414(a)rit.edu> napisao je:
> Hi all,
> I am new to PyCuda API.
>
> Here what I want to do:
> 1. Install BGP tables from source website to my cluster node
> 2. Process all data and store in Database.
> 3. Run AS-level pathfinding the algorithm to get paths for
> particular destination prefix.
>
> It is a networking project under the topic Internet Cartography.
>
> Now,
> I need help in step 1 and 2.
> Idea is that BGP route files should be accessible to GPU nodes and so the
> MySQL database that store data.
> For each prefix, GPU device will run algorithm separately where it access
> the database.
>
> Now my question is:
> 1. Can I have shared a database, for each GPU node and if yes, how can I
> implement in PyCuda.
> 2. Can I have a separate database, on each GPU node? If yes, how can I
> implement in PyCuda? I will prefer this against step 1 due to communication
> overhead in a database query.
> 3. How can I share files among all GPU nodes in PyCuda?
> 4. Can I have the share read access to all GPU nodes in PyCuda?
>
> Do let me know if any point is not clear.
> Waiting for reply.
> I have high hopes with this group for my project.
>
> Cheers
> Sahil Gupta
>
>
>
>
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> https://lists.tiker.net/listinfo/pycuda
>
--
Vedran Miletić
vedran.miletic.net
Hi all,
I am new to PyCuda API.
Here what I want to do:
1. Install BGP tables from source website to my cluster node
2. Process all data and store in Database.
3. Run AS-level pathfinding the algorithm to get paths for
particular destination prefix.
It is a networking project under the topic Internet Cartography.
Now,
I need help in step 1 and 2.
Idea is that BGP route files should be accessible to GPU nodes and so the
MySQL database that store data.
For each prefix, GPU device will run algorithm separately where it access
the database.
Now my question is:
1. Can I have shared a database, for each GPU node and if yes, how can I
implement in PyCuda.
2. Can I have a separate database, on each GPU node? If yes, how can I
implement in PyCuda? I will prefer this against step 1 due to communication
overhead in a database query.
3. How can I share files among all GPU nodes in PyCuda?
4. Can I have the share read access to all GPU nodes in PyCuda?
Do let me know if any point is not clear.
Waiting for reply.
I have high hopes with this group for my project.
Cheers
Sahil Gupta