Attached a rather contrived example of computing ffts with pyfft. The
names say it all: serial.py, streams.py, streams-time.py.
I have tried to make an example utilizing streams. However, I'm not
convinced that it actually works as expected. If you can convince me
that it works or improve the code I promise to clean it up and put it
on the wiki.
1: Checking the GPU time width plot in the Compute visual profiler do
not show any overlap between stream 1 and 2. Reading over at the CUDA
forums it seems that perhaps this is caused by the profiler and that
running the code outside the profiler would not give the same
behaviour. Is this true? (And how do you profile your code if the
profiler is broken?)
2: The streamed version runs faster than the serial version. However,
I have a nagging suspicion that this speedup is only from faster
mem-copies and not from any overlap between streams. E.g., putting in
a line to print the time after each line shows that the "python time"
of the first mem-copy is ~0.3 ms while the "python time" of the first
fft call is ~ 6 ms while the second fft call is ~0.1 ms. 6 ms happens
to be the time of the mem-copy according to the visual profiler !?!
Can anyone confirm this ... is the first fft call blocking until the
data has been copied to the device? (also the get_async seems to be
blocking according to the "python time")
Any help appreciated.
School of Computer Science, Physics and Mathematics
Show replies by thread