The findings below are considering I already have a 20 million * 57 bits int array in the GPu.

On Jun 6, 2018, at 3:05 AM, aseem hegshetye <aseem.hegshetye@gmail.com> wrote:

Hi,
I did some testing with number of threads. I changed number of threads and recorded the time in seconds it took for the pyopencl kernel to execute.
Following are the results:
  •  No_of_threads --- Time in seconds
  • 10,000 -- 202
  • 20,000 -- 170
  • 24,000 -- 209
  • 30,000 -- 224
  • 30714 -- 659
Thanks
Aseem

On Wed, Jun 6, 2018 at 1:54 AM, Sven Warris <sven@warris.nl> wrote:
Hi Aseem,

This maybe caused by memory access collisions and/or lack of coalesced memory access. This technical report gives some pointers:
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.pdf
Do you use atomic operations? Or maybe you have too many thread fences?
I have no problem starting many threads: the number of threads alone is not the issues.

Cheers,
Sven


Op 6-6-2018 om 8:37 schreef aseem hegshetye:
Hi,
Does GPU speed exponentially drop as number of threads increase beyond a certain number?. I used to allocate number of threads= number of transactions in data under consideration.
For Tesla K80 I see exponential drop in speed above 30290 Threads. 
If true, is it a best practice to keep number of threads low and iterate over the data to get results at optimum speed. 
How to find best number of threads for a GPU?

Thanks
Aseem


_______________________________________________
PyOpenCL mailing list
PyOpenCL@tiker.net
https://lists.tiker.net/listinfo/pyopencl



_______________________________________________
PyOpenCL mailing list
PyOpenCL@tiker.net
https://lists.tiker.net/listinfo/pyopencl