Archive for the 'Programming' CategoryPage 2 of 3

Fast SSE Select Operation

Some SIMD architectures have a select instruction which combines two vector registers based on a third vector mask. I lust for such an instruction, but silly little two-operand SSE cannot currently support it.

Of course, people build it out of more primitive ops. For example, here’s what Apple suggests when porting to SSE from Altivec (which supports select):

(Note: this has been translated into Microsoft syntax)

__m128
_mm_sel_ps_apple(const __m128& a, const __m128& b, const __m128& mask)
{
    // (b & mask) | (a & ~mask)
    return _mm_or_ps( _mm_and_ps( b, mask ), _mm_andnot_ps( mask, a ) );
}

We mask b, inverse-mask a, and slam the result together with an or. This is the easy and obvious solution — and the one I used for years.

Here’s a better way:

__m128
_mm_sel_ps_xor(const __m128& a, const __m128& b, const __m128& mask)
{
    // (((b ^ a) & mask)^a)
    return _mm_xor_ps( a, _mm_and_ps( mask, _mm_xor_ps( b, a ) ) );
}

Witness the amazing power of xor! Here we calculate the bitwise difference between a and b. Then we mask and selectively apply this difference to convert some of the a’s into b’s. This is freaking genius. (I wish I could claim to have invented it.)

Here’s the assembly that my compiler generates for the two sequences. First, the naive way:

 movaps      xmm2, XMMWORD PTR [r8] ; mask
 movaps      xmm1, XMMWORD PTR [rdx] ; b
 andps       xmm1, xmm2
 movaps      xmm0, xmm2
 andnps      xmm0, XMMWORD PTR [rcx] ; a
 orps        xmm0, xmm1

Go-go gadget bit-twiddler:

 movaps      xmm1, XMMWORD PTR [rcx] ; a
 movaps      xmm0, XMMWORD PTR [rdx] ; b
 xorps	      xmm0, xmm1
 andps	      xmm0, XMMWORD PTR [r8] ; mask
 xorps	      xmm0, xmm1

It may not be obvious, but the 2nd sequence is much better because all it’s operations are commutative. Once this little select routine is inlined, a good register allocator will arrange to kill whatever operand may happen to be dead-out. By comparison, the first sequence constrains register allocation with that pesky noncommutative andnps instruction — which has to destroy the mask.

(credit to Jim Conyngham and the MD5 Wikipedia page)

News Flash: Tech Interviews Don’t Work

There’s been a lot of buzz around the blogosphere lately about various issues with interviewing programmers.

One of these days I’d like to do a study. I’d find a very brave employer who needs to hire 100 people. We would use traditional interview techniques to fill 50 positions. For the other half, we’d pluck reasonable resumes from a pile, choose 50 at random, and just hire them.

After a year, I would return and talk to management. My guess is there would be no significant difference in the performance of the two groups.

If you are anything like me, you’ve been on both sides of the “hiring table” enough times to know that my hypothesis is probably right. There are just too many ways that an interview can fail.

If you don’t ask any tech questions at all, you can end up hiring someone who literally cannot code. Most employers are terrified of these “false positives” and so, instead, they make their interviews highly technical. They only hire people who can remember, on the day of the interview, how to solve the dining philosophers problem, or how heapsort works. However, nerd brains have aggressive garbage collectors which quickly expunge unused information. In English: they forget stuff. This leads to a high “false negative” rate — which is a more insidious problem because it tends to inhibit diversity.

A team with 5 heapsort experts is likely to suck for the same reason that a rock band with 5 drummers would suck. But I digress.

The typical whiteboard question is highly ego-centric. We go into a little room with a question that we’ve asked a dozen times and understand fully. We pose this question to a “lesser” nerd, in a high pressure environment, and carefully note all the missed semicolons. We chuckle to ourselves, secure in our supreme geekhood.

Ego deflation alert: 100% of these candidates could turn around and immediately ask an “interview question” that we could not answer. Oops. Who’s the idiot now?

By focusing on the whiteboard, interviews typically fail to evaluate candidates on many orthogonal — but critically important — axes. For example, how well does the candidate learn? How well do they teach? Do they ask good questions? Are they able to speak and write clearly?

Almost regardless of what questions you ask, there are a number of things that your interview cannot possibly measure. These are things which require months, not hours, to ascertain: work ethic, initiative, team skills, personality, team compatibility, self-motivation, ability to compromise, creativity, etc.

I think the only way to fix this system is to restructure the entire employment process. The root cause is that it’s currently way too hard to fire someone. If professional programmers were more like professional athletes — hired on a contract basis for a specific duration — things would be much different.

Know Your Roots: Optimizing for the 286 and 386

I just read this old Mike Abrash article (warning, 3.5M PDF) from Byte magazine about optimizing for the 286 and 386. If you don’t know Mike Abrash’s name, you almost certainly know his work.

This is fascinating stuff.

Probably the most amazing thing is how much instruction fetch dominates 286/386 performance. Abrash frequently counts instruction bytes and includes this in his calculations for expected execution speed. After a bit of Googling, this made more sense: these were the days before instruction cache.

(Although it’s still possible, you rarely see fetch-limited code on something like a K8.)

So, if you think x86 is an ugly instruction set, consider this heritage. Variable instruction length actually made a lot of sense back then. It was a form of compression. For the same reason, Intel added complex instructions like BOUNDS and REP MOVSW. These compactly express a whole bunch of work.

I guess some things never change. I still do the same kind of measurements that Abrash did all those years ago. There are small differences — he calls a timer routine, where I can simply execute rdtsc — but the method is the same. I find this remarkable, considering how different the machines are.

(via Osterman’s blog)

Intel Programming Seminar Series

Intel Research Berkley has been host to a whole slew of nerd gods lately, and they have been foolish enough to post the seminar videos. Ha, suckers! Now I will become smarter on your dime!

But seriously, Intel is awesome for doing this.

Past speakers include: Hans Boehm, Herb Sutter, Greg Morrisett, Guido van Rossum, Guy Steele, Martin Odersky, and Alan Kay. You’d better get to work watching all of those, because still to come are: Martin Rinard, Bertrand Meyer, Bjarne Stroustrup, Charles Leiserson, and Philip Wadler.

Update: The Alan Kay talk is tremendously good. Go watch it. I mean now, really. I’ll wait.

MIT Intro Algorithms Course

MIT OpenCourseWare LogoMIT’s OpenCourseWare initiative is pretty cool. I just discovered the MIT 6.046J: Introduction to Algorithms home page. Here you’ll find notes (PDF), audio (MP3) and video (RealMedia) for the lectures. It’s been years since I learned this stuff, but it’s never too late for a refresher course.

SE-Radio.net Interviews Guy Steele

Guy talks mostly about Fortress, which I’ve previously blogged about. The interview was conducted at JAOO 2006 in Denmark.

For more, including a nice podcast, head over to SE-radio.net.

The Upshot of Ugly Syntax

I just realized that one of the reasons I choose Perl for “quick hacks” is because the syntax is so ugly and inconsistent.

It’s a psychological thing. If I were to select Ruby, for example, I’d feel a tremendous obligation to write a beautiful well-designed program. I’d probably create a subversion repository, create unit tests, automate, etc. These are all important things, but I want to amortize their up front cost over the term of a long project.

I choose Perl because it takes the pressure off.

Branch-Free Absolute Value Calculation

Microsoft’s Visual C compiler generates the following branch-free code to calculate the absolute value of an integer.


   cdq
   xor     eax, edx
   sub     eax, edx

The cdq instruction sign-extends eax into the register pair edx:eax. The cute trick here is to abuse that behavior to get an all-ones register (edx) in the case of negative input. This can be used to compute the 1’s complement, and we can also subtract it to add one (for the 2’s complement). In the case of a positive input, edx is zero and thus the xor and sub are harmless.

This is faster than the naive version, which would likely contain an unpredictable branch.

I’m surely not clever enough to have written abs this way. It seems like something the Superoptimizer might have conjured up.

Writings of E. W. Dijkstra

The University of Texas hosts an archive of the writings of the late Professor Dijkstra. This is a treasure trove of musings by one of the greatest minds in all of Computer Science. You could do much worse with your time than to spend it poking around the archive. Today, for example, I read this gem:

Consider the plane figure Q, defined as the 8 by 8 square from which, at two opposite corners, two 1 by 1 squares have been removed. The area of Q is 62, which equals the combined area of 31 dominos of 1 by 2. The theorem is that the figure Q cannot be covered by 31 of such dominos.

Another way of stating the theorem is that if you start with squared paper and begin covering this by placing each next domino on two new adjacent squares, no placement of 31 dominos will yield the figure Q.

So, a possible way of proving the theorem is by generating all possible placements of dominos and verifying for each placement that it does not yield the figure Q: a tremendously laborious job.

The simple argument, however is as follows. Colour the squares of the squared paper as on a chess board. Each domino, covering two adjacent squares, covers 1 white and 1 black square, and, hence, each placement covers as many white squares as it covers black squares. In the figure Q, however, the number of white squares and the number of black squares differ by 2 –opposite corners lying on the same diagonal– and hence no placement of dominos yields figure Q.

(Credit to Bheeshmar for leading me to EWB1036)

Off the Grid

If I ever got the chance to use Subversion on a large project, I think I’d branch my ass off. Here’s why:

Act 1, The Vacation
Imagine you’re 30% done with some changes to your project. You’ve done some good, but incomplete, work. Perhaps the code doesn’t even compile in it’s current state. Now you need to go on vacation. You can’t check in, because it will break the rest of the system. But unless you do something to backup your work, you risk losing it all in a hard disk failure.

Act 2, Two Roads Diverged
Back from vacation, and you now find yourself 50% done and facing a tough decision. You can think of two different ways to continue your implementation, and you’re not yet sure which is best. Again, you can’t check your half-done work into source control without incurring the wrath of the team.

The state of the art, at least in the places I’ve worked, is to solve both of these problems by going “off the grid”, and copying zip files around. Essentially this forgoes all the benefits of source control for the duration of the task.

The real solution to this problem is a the concept of a work-in-progress branch.

If each developer branches from the mainline before starting work, she can check in broken code at will, because nobody else cares about her branch. She can checkpoint her work, attempt something, and go back if she changes her mind. She can even branch her branch, if she finds herself truly indecisive. Additionally, she gets to put her important source code on a system with regular backups. Eventually, the WIP branch is merged into the trunk and left to wither and die.

Am I crazy, or does that just feel right?

Some of you may have a legitimate excuse for the off-grid behavior, like a poor version control system. None of this is really feasible unless you have extremely cheap branching, like Subversion’s slick copy-on-write system.

I suspect there are plenty of folks out there who are already using WIP branches (because I’m notoriously late to the party). I’d love to hear your comments.




Creative Commons Attribution-NonCommercial 3.0 United States
Creative Commons Attribution-NonCommercial 3.0 United States