Andreas - do not worry, I know how academic life look like ;-)
I just wanted to be sure that my patch did not disappear
"under the radar".
Dnia 2010-10-22, pią o godzinie 09:27 -0400, Frédéric Bastien pisze:
I hit a similar problem when porting Theano to a newer version of
one that support Fermi card). The problem is not the card, but the
compiler/card combination. What happen is that newer nvcc version make
more code optimization and it break some old assumption that many
people did. The new optimization is based on the new assumption that
they can move the order of operation between wrap if their is no
synchronization barrier. I think this was not a problem on pre Fermi
card as the wrap was executed sequentially, but that is not always
true on the Fermi card(I'm not certain for the pre Fermi, but I'm
sur for the Fermi). The volatile keyword tell the compiler that
the value can change between thread operation and so it don't
do the new optimization.
Did I explain clearly enough what happened?
IMO it is more or less correct. I will try to explain problem
and my road to solution with more details.
<put on my Teacher Assistant hat>
As some of you noted, few tests of GPUArray failed on Fermi.
In the beginning I assumed (and wrote about it on this list
on 2010-09-27 21:06) that this is caused because of changes
in IEEE-754 implementation done in Fermi devices.
Then jmcarval (sorry if I forgot your name) send test case
(2010-09-28 9:29) that shown that there is something more
in case of Fermi. I wanted to cause Fermi to run older code
(email from 2010-09-28 23:56) but it was not possible - sm_2
device cannot run sm_11 code (further explanation in email
from 2010-10-01 00:00).
I started analysing code generated by nvcc and noticed that
code for devices with sm_2 is optimised more agressively than
code for sm_11 (Ion). For example it keeps more variables
in registers, not reading them from the memory (email from
2010-10-01 0:40). But I have not found differences in the code
that reduces, only in preparatory code.
Then I started analysing reduction kernels. I have noticed that
when there is large number of threads, code calls __syncthreads(),
and when number of threads drops below 32 it stops
synchronising threads. I added synchronisation to the code
(first patch, synchronize.diff).
I must admit that I totally forgot about warps - I was thinking
in terms of blocks and grids and was baffled by groups of 32 threads.
Then I went back to documentation, noticed that there are documents
about Fermi, and found this "volatile" in chapter 1.2.2; voila,
patch volatile.diff from the same email.
Explanation why it works (at least according to my knowledge,
I do not know intimate hardware details):
In the Tesla architecture, each SM had 8 cures, so one warp was
executed in 4 steps. There was also limited number of registers,
so compiler generated code that was making sure that everything
is held in memory, so there are registers ready to be used.
Fermi (2.x capabilities) offer cache and more registers
(32k of registers vs. 8k in 1.1 devices). This way compiler need
not to worry about registers. So it optimises code, keeps values
in registers, and flushed cache and registers only on thread barrier
or on demand. It leads to faster code, but also to difference between
what is in register and what is in memory.
This is important in Fermi, as it has 32 cores per SM and can execute
half-warps in different order, disregarding inter-thread dependencies.
Situation gets even more interesting in 2.1 devices, which have 48 cores
Keyword "volatile" tells compiler that value held in this variable
may be changed by some mechanism that program does not control.
In the "old times" it was used in programming hardware devices; hardware
registers were declared as "volatile" variables, as their values were
influenced by external device.
Basically "volatile" disallows compiler from using optimisations
described two paragraphs above. Each time program reads variable, it
makes sure that it reads it from memory, not from cache or register.
Each time program writes variable, it writes it to memory, not tu cache
or register. This way any other thread will read proper, recently
calculated value, not the old value.
Maybe it is time to teach students again about "volatile", "register"
and other types of variables in Introduction to Programming?
If you want to learn more about cache coherency, and so on,
watch Mare Jane Irving lecture about cache in multicore CPUs
which she gave during ACM meeting when she received Athena Award:
BTW I have GTX460 - Fermi with 2.1 CC. To not bore you with details,
now I have new GPU, my brother has my previous card instead of his
old broken one and everyone's computer is working. It was rather
convenient timing, but I am not complaining - I got new GPU!
I have also checked both patches (volatile.diff and synchronize.diff)
and both of them work.
I have checked that in 0.92.2 generated PTX uses
to fetch values from memory in the final reduction warp.
It also uses more registers (48 instead of 43) in the entire
I hope that it explains what is going on, what has changed in Fermi,
and how it was corrected in PyCUDA.
Tomasz Rybak <bogomips(a)post.pl> GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860