The classic semantics of C/C++ volatile are: “Please Mr. Compiler do exactly the reads and writes I specify.” Consider the following program:
int x = 0;
void foo( int c )
{
while( --c )
x += c;
}
In this case x is just a regular integer, so the compiler is free to register allocate and thus generates this:
sub ecx, 1
je end ; zero-trip loop
mov eax, DWORD PTR x
looptop:
add eax, ecx
sub ecx, 1
jne looptop
mov DWORD PTR x, eax
end:
Notice how x is loaded before the loop begins, allocated to the register eax, and has its final value stored back to memory after the loop. Now let’s see what happens when x is qualified as volatile:
sub ecx, 1
je end ; zero-trip loop
looptop:
add DWORD PTR x, ecx ; add-to-memory
sub ecx, 1
jne looptop
end:
We are now doing a so-called Read-Modify-Write instruction inside the loop body. The add is being performed to (and from) memory.
In Visual Studio 2005 the semantics of the volatile qualifier were expanded to reflect modern use of the keyword in multi-threaded code (for an example, see the Wikipedia entry for Double-Checked Locking). This can lead to some surprising behavior:
volatile unsigned x;
unsigned y;
void foo( int c )
{
while( --c )
x ^= y;
}
The previous version of Visual Studio (2003) will generate this:
mov eax, DWORD PTR c
dec eax
je end ; zero-trip loop
mov ecx, DWORD PTR y
looptop:
mov edx, DWORD PTR x
xor edx, ecx
dec eax
mov DWORD PTR x, edx
jne looptop
end:
Since x is volatile, it is loaded and stored inside the loop body. The other global, y, is enregistered outside of the loop. This is very sensible. Look what happens when we use VS 2005:
mov eax, DWORD PTR c
sub eax, 1
je end
looptop:
mov ecx, DWORD PTR y
xor DWORD PTR x, ecx
sub eax, 1
jne looptop
end:
VS 2005 generates the xor-to-memory for x, which is probably an improvement over the old discrete load and store. More important is what has happened to y. It is now repeatedly loaded from memory inside the loop, despite being totally non-volatile!
There is a long, complex, explanation behind this which I will save for another day. For now, just be aware that the volatile qualifier in VS 2005 has extra-standard behavior which can impair the compiler’s ability to optimize your code.
Use it with caution.

Although I am a huge fan of the RISC concept, registers really are overrated. Since a register is just an explicitly named level-0 data cache, it follows that effective addresses into memory should be just as fast as a register hit (ignoring logic propegation delays, obviously — I’m not THAT dense!). It only takes a 2-read-1-write data cache interface, which can be effectively emulated by a cache driven by a CPU clock twice as fast as the execution core’s pipeline.
Yes, it is more transistors than a normal register array. But, then, the Intel philosophy is to just throw more transistors at things anyway. For all we know, Intel and AMD processors are already doing precisely this, which is likely why your compilers elected to perform “to-memory” operations instead of “to-register” operations.
Gosh, where do I begin?
A register and a cache are different.
You have to ask a cache, “Do you currently have the value of memory location X?” This is called a probe and it takes time. The answer can come back “no” (miss) or “yes” (hit). Modern CPUs require something on the order of 3 cycles to access their first level cache.
There are no register file “hits” or “misses”. The code asks for register #10, you go get it. That’s it. It’s always there. Register files can be read/written multiple times in a single clock cycle. A register-to-register move, for example, takes exactly 1 cycle.
It’s not an accident that today’s CPUs have synchronous L1 caches. It’s not because hardware designers didn’t ever think, “Hey, what if we made the cache run twice as fast?”
We make everything run as fast as it possibly can. This is measured in pico-seconds, which is an absolute measurement. Clock cycles are relative. At 3Ghz a clock is 330 picoseconds. So on my hardware, an ADD takes 330 picoseconds. An L1 cache hit takes 990.
You can slow your execution units down if you want. You can perform the ADD in 2790 picoseconds, and then claim your cache runs “twice as fast.” I won’t be impressed until you can do a cache probe in less than 990 pico seconds.
–Mark
PS: The “to-memory” operations above are due to the keyword “volatile” from the C language. http://en.wikipedia.org/wiki/Volatile_variable