Monthly Archive for March, 2007

Debunking the 10x Productivity Myth

You can’t work in this industry for very long before being confronted with the “fact” that super-productive programmers are an order of magnitude faster than you.

(Seriously, how many geeks hear this statistic and rejoice? Finally someone has quantified my amazing ability! No. Nobody does that. Everyone despairs.)

There is good news, however. Upon hearing the 10x number again recently, I started Googling and discovered that, lo and behold, it’s bullshit! The original Sackman study, despite being cited by Fred Brooks and Steve McConnell, is actually a deeply flawed assesment of productivity. Subsequent research indicates that the best performers are merely a binary order of magnitude (just 2x) better.

For more info, check out this great series of articles over at Best Webfoot Foward.

PS: I’m curious about something that I haven’t seen a study directly address so far: what is the typical productivity difference between a person’s best performance and that same individual’s worst performance?

Chris Hecker is Wrong About OoO Execution

Here is a quote from Chris Hecker at GDC 2005:

Modern CPUs use out-of-order execution, which is there to make crappy code run fast. This was really good for the industry when it happened, although it annoyed many assembly language wizards in Sweden.

I first heard this when Chris was quoted by Pete Isensee (from the XBOX 360 team) in his NWCPP talk a year ago. Maybe Chris was kidding. I don’t know. What I do know is:

  1. He is wrong
  2. Smart people are believing him
  3. It’s time to set the record straight

Processors implement dynamic scheduling because sometimes the ideal order for a given sequence of instructions can only be known at runtime. In fact, the ideal order can change each time the instructions are executed.

Imagine your binary contains the following very simple code:


     mov rax, [foo]
     mov rbx, [bar]

Two loads — that’s all. Lets assume that each of the loads misses cache 10% of the time. Often, one will miss but the other will hit. If you have an in-order machine, and the first load misses, you are forced to wait — you cannot proceed to the 2nd load, and you cannot hide any of the miss latency.

No matter how much of an uber assembly coder you are, you are going to be forced to choose an order for these two loads. More likely, your compiler will make this choice for you. Either way, that choice will be wrong at least some of the time.

An OoO processor can do the right thing every time.

Fast SSE Select Operation

Some SIMD architectures have a select instruction which combines two vector registers based on a third vector mask. I lust for such an instruction, but silly little two-operand SSE cannot currently support it.

Of course, people build it out of more primitive ops. For example, here’s what Apple suggests when porting to SSE from Altivec (which supports select):

(Note: this has been translated into Microsoft syntax)

__m128
_mm_sel_ps_apple(const __m128& a, const __m128& b, const __m128& mask)
{
    // (b & mask) | (a & ~mask)
    return _mm_or_ps( _mm_and_ps( b, mask ), _mm_andnot_ps( mask, a ) );
}

We mask b, inverse-mask a, and slam the result together with an or. This is the easy and obvious solution — and the one I used for years.

Here’s a better way:

__m128
_mm_sel_ps_xor(const __m128& a, const __m128& b, const __m128& mask)
{
    // (((b ^ a) & mask)^a)
    return _mm_xor_ps( a, _mm_and_ps( mask, _mm_xor_ps( b, a ) ) );
}

Witness the amazing power of xor! Here we calculate the bitwise difference between a and b. Then we mask and selectively apply this difference to convert some of the a’s into b’s. This is freaking genius. (I wish I could claim to have invented it.)

Here’s the assembly that my compiler generates for the two sequences. First, the naive way:

 movaps      xmm2, XMMWORD PTR [r8] ; mask
 movaps      xmm1, XMMWORD PTR [rdx] ; b
 andps       xmm1, xmm2
 movaps      xmm0, xmm2
 andnps      xmm0, XMMWORD PTR [rcx] ; a
 orps        xmm0, xmm1

Go-go gadget bit-twiddler:

 movaps      xmm1, XMMWORD PTR [rcx] ; a
 movaps      xmm0, XMMWORD PTR [rdx] ; b
 xorps	      xmm0, xmm1
 andps	      xmm0, XMMWORD PTR [r8] ; mask
 xorps	      xmm0, xmm1

It may not be obvious, but the 2nd sequence is much better because all it’s operations are commutative. Once this little select routine is inlined, a good register allocator will arrange to kill whatever operand may happen to be dead-out. By comparison, the first sequence constrains register allocation with that pesky noncommutative andnps instruction — which has to destroy the mask.

(credit to Jim Conyngham and the MD5 Wikipedia page)

The Case for RISC

I’ve been reading Patterson’s classic RISC manifesto tonight.

I find nearly all of his arguments compelling. I’m so credulous. I’ve got to disagree with something or I’ll lose my CISC-zealot card. Ah, here we are:

Better use of chip area.
If you have the area, why not implement the CISC? For a given chip area there are many tradeoffs for what can be realized. We feel that the area gained back by designing a RISC architecture rather than a CISC architecture can be used to make the RISC even more attractive than the CISC. For example, we feel that the entire system performance might improve more if silicon area were instead used for on-chip caches [Patterson,Srquin80], larger and faster transistors, or even pipelining. As VLSI technology improves, the RISC architecture can always stay one step ahead of the comparable CISC. When the CISC becomes realizable on a single chip, the RISC will have the silicon area to use pipelining techniques; when the CISC gets pipelining the RISC will have on chip caches, etc. The CISC also suffers by the fact that its intrinsic complexity often makes advanced techniques even harder to implement.

I think this one ended up not being a very big deal. Once we started translating the software-visible CISC ops into internal micro-ops, we got most of the advantages of RISC. The decoder is a complex machine, but we solved it once and now it’s just a fixed amount of overhead.

But what the heck do I know — I’m a software guy. I’m just parroting the hardware gurus.

Learn Computer Architecture From Dave Patterson

Book Cover: Henessey & Patterson: Computer Architecture

Wish you went to Berkeley and learned Comp Arch from the man who wrote the book? I’ve been working for AMD for almost 7 years, and without hesitation I can say that my answer to this question is an emphatic “Yes!” There is always something more to learn.

Well, wish no more. Head over to Berkeley’s CS 252 site and watch the vids.

Update:
Just discovered the slides in PowerPoint and PDF format here

(via Massive Resource List for All Autodidacts)

MacBook WiFi Craps Out Near Antenna

My black MacBook has the strangest WiFi problem. It works great all over my house, except in my office — where the access point itself is! I’ve never heard of a WiFi signal that was too strong.

Anyone else have this problem?




Creative Commons Attribution-NonCommercial 3.0 United States
Creative Commons Attribution-NonCommercial 3.0 United States