<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Surprising Effects of Volatile Qualifier</title>
	<atom:link href="http://mark.santaniello.com/archives/344/feed" rel="self" type="application/rss+xml" />
	<link>http://mark.santaniello.com/archives/344</link>
	<description>the body of a very slow loop</description>
	<pubDate>Thu, 20 Nov 2008 20:35:04 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5.1</generator>
		<item>
		<title>By: Mark</title>
		<link>http://mark.santaniello.com/archives/344#comment-36918</link>
		<dc:creator>Mark</dc:creator>
		<pubDate>Sat, 04 Aug 2007 07:17:01 +0000</pubDate>
		<guid isPermaLink="false">http://mark.santaniello.net/archives/344#comment-36918</guid>
		<description>Gosh, where do I begin?

A register and a cache are different.  

You have to ask a cache, "Do you currently have the value of memory location X?"  This is called a probe and it takes time.  The answer can come back "no" (miss) or "yes" (hit).  Modern CPUs require something on the order of 3 cycles to access their first level cache.

There are no register file "hits" or "misses".  The code asks for register #10, you go get it.  That's it.  It's always there.  Register files can be read/written multiple times in a single clock cycle.  A register-to-register move, for example, takes exactly 1 cycle.

It's not an accident that today's CPUs have synchronous L1 caches.  It's not because hardware designers didn't ever think, "Hey, what if we made the cache run twice as fast?"  

We make everything run as fast as it possibly can.  This is measured in pico-seconds, which is an absolute measurement.  Clock cycles are relative.  At 3Ghz a clock is 330 picoseconds.  So on my hardware, an ADD takes 330 picoseconds.  An L1 cache hit takes 990.  

You can slow your execution units down if you want.  You can perform the ADD in 2790 picoseconds, and then claim your cache runs "twice as fast."   I won't be impressed until you can do a cache probe in less than 990 pico seconds.

--Mark

PS: The "to-memory" operations above are due to the keyword "volatile" from the C language.  http://en.wikipedia.org/wiki/Volatile_variable</description>
		<content:encoded><![CDATA[<p>Gosh, where do I begin?</p>
<p>A register and a cache are different.  </p>
<p>You have to ask a cache, &#8220;Do you currently have the value of memory location X?&#8221;  This is called a probe and it takes time.  The answer can come back &#8220;no&#8221; (miss) or &#8220;yes&#8221; (hit).  Modern CPUs require something on the order of 3 cycles to access their first level cache.</p>
<p>There are no register file &#8220;hits&#8221; or &#8220;misses&#8221;.  The code asks for register #10, you go get it.  That&#8217;s it.  It&#8217;s always there.  Register files can be read/written multiple times in a single clock cycle.  A register-to-register move, for example, takes exactly 1 cycle.</p>
<p>It&#8217;s not an accident that today&#8217;s CPUs have synchronous L1 caches.  It&#8217;s not because hardware designers didn&#8217;t ever think, &#8220;Hey, what if we made the cache run twice as fast?&#8221;  </p>
<p>We make everything run as fast as it possibly can.  This is measured in pico-seconds, which is an absolute measurement.  Clock cycles are relative.  At 3Ghz a clock is 330 picoseconds.  So on my hardware, an ADD takes 330 picoseconds.  An L1 cache hit takes 990.  </p>
<p>You can slow your execution units down if you want.  You can perform the ADD in 2790 picoseconds, and then claim your cache runs &#8220;twice as fast.&#8221;   I won&#8217;t be impressed until you can do a cache probe in less than 990 pico seconds.</p>
<p>&#8211;Mark</p>
<p>PS: The &#8220;to-memory&#8221; operations above are due to the keyword &#8220;volatile&#8221; from the C language.  <a href="http://en.wikipedia.org/wiki/Volatile_variable" rel="nofollow">http://en.wikipedia.org/wiki/Volatile_variable</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Samuel A. Falvo II</title>
		<link>http://mark.santaniello.com/archives/344#comment-36908</link>
		<dc:creator>Samuel A. Falvo II</dc:creator>
		<pubDate>Fri, 03 Aug 2007 21:31:36 +0000</pubDate>
		<guid isPermaLink="false">http://mark.santaniello.net/archives/344#comment-36908</guid>
		<description>Although I am a huge fan of the RISC concept, registers really are overrated.  Since a register is just an explicitly named level-0 data cache, it follows that effective addresses into memory should be just as fast as a register hit (ignoring logic propegation delays, obviously -- I'm not THAT dense!).  It only takes a 2-read-1-write data cache interface, which can be effectively emulated by a cache driven by a CPU clock twice as fast as the execution core's pipeline.

Yes, it is more transistors than a normal register array.  But, then, the Intel philosophy is to just throw more transistors at things anyway.  For all we know, Intel and AMD processors are already doing precisely this, which is likely why your compilers elected to perform "to-memory" operations instead of "to-register" operations.</description>
		<content:encoded><![CDATA[<p>Although I am a huge fan of the RISC concept, registers really are overrated.  Since a register is just an explicitly named level-0 data cache, it follows that effective addresses into memory should be just as fast as a register hit (ignoring logic propegation delays, obviously &#8212; I&#8217;m not THAT dense!).  It only takes a 2-read-1-write data cache interface, which can be effectively emulated by a cache driven by a CPU clock twice as fast as the execution core&#8217;s pipeline.</p>
<p>Yes, it is more transistors than a normal register array.  But, then, the Intel philosophy is to just throw more transistors at things anyway.  For all we know, Intel and AMD processors are already doing precisely this, which is likely why your compilers elected to perform &#8220;to-memory&#8221; operations instead of &#8220;to-register&#8221; operations.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
