Monthly Archive for April, 2008Page 2 of 2

Revisiting Fast SSE Select

I started thinking about this SSE select operation again, probably round about the time I learned of AMD’s SSEPlus and its logical_bitwise_choose function.

(Awesome library, terrible function name. But I digress…)

There are two alternatives at hand. One is the classic, obvious method:

  ; ((a & MASK) | (b & ~MASK))
  ; xmm0 = MASK
  ; xmm1 = a
  ; xmm2 = b
  andps  xmm1, xmm0
  andnps xmm0, xmm2
  orps   xmm0, xmm1

The other method uses fancy bit-twiddling:

  ; (((a ^ b) & MASK) ^ b)
  ; xmm0 = b
  ; xmm1 = a
  ; xmm2 = MASK
  xorps xmm0, xmm1
  andps xmm0, xmm2
  xorps xmm0, xmm1

I’ve long regarded the 2nd sequence as better, but I’m no longer sure. Take a look at the DAGs for the two sequences:

Obvious Method
Xor Method

The xor method avoids destroying the mask, but requires three serially-dependent instructions. The “obvious” method provides parallelism, but destroys the mask.

This is an example of a classic compiler phase-ordering problem. Optimal instruction selection depends on knowledge of the “liveness” of the mask variable.

Rotate CMD.exe Colors

Call this function to rotate through different CMD.exe color schemes on Windows. Modify the COLOR_LIST variable as you like (see COLOR /? for details).

:SetColor
if NOT DEFINED COLOR_LIST set COLOR_LIST=0A0B0E
setlocal

set CAR=%COLOR_LIST:~0,2%
set CDR=%COLOR_LIST:~-4%

color %CAR%

REM rotate w/ export hack
for %%D IN (1) DO (
   endlocal
   set COLOR_LIST=%CDR%%CAR%
)

exit /B

Update:
It might be desirable to change the line set COLOR_LIST=%CDR%%CAR% to instead use the setx command like this:

setx COLOR_LIST %CDR%%CAR%

This sets a global environment variable instead of a local one which merely applies to the current CMD window and descendants.

No, Donald Knuth, Half the TLB is Not “Wasted”

<disclaimer>
Donald Knuth is a genius. If he suffered significant head trauma, was sleep-deprived for days, and then pounded 10 shots of Jager in 10 minutes while holding his breath, his pinky toe could still beat me at Scrabble. I mean, the man is so smart that I am legally retarded by comparison. I should be arrested for even speaking his name.
</disclaimer>

Donald Knuth apparently gave a Valentine’s Day lecture, and at least one fella came away with this in his notebook:

He talked about how most userland programs were still 32-bit and how operating systems and processors had moved on to 64-bits and made the point that in this scenario, half of the bits in the TLB were being wasted.

Hmm, no. That just sounds wrong.

A TLB is this pretty wicked gizmo that performs address translations literally in one nanosecond via the magic of caching and SRAM. I’m not a hardware guy, but I know that this SRAM crap doesn’t come cheap. Would anyone ever waste it? I think not.

AMD64 chips employ a simple canonical address rule which limits the effective virtual address space to 48-bits while allowing for future expansion. It seems that current implementations map these virtual addresses into a 52-bit physical address space.

It’s pretty clear that we didn’t invent this idea. The MMU on the DEC Alpha, for example, left the upper 21 address bits unused.

So, sure, the page tables themselves hold 64-bits, but the hardware need only map subsets — and I’ll bet you lunch that the TLB’s SRAM structures are sized for this requirement.

More Stupid x86 Assembly Tricks

A couple of weeks ago I went hunting for a better way to compute x!=0 on x86. Eventually, I came up with a cute carry-flag trick and blogged about it.

(Note: I’m not branching on this comparison — that would be easy. Instead I want the value of the comparison in a general-purpose register. I should have made this explicitly clear in my original post. Alas, I did not. Doh.)

My goal was to avoid using setcc, because partial-register writes are the devil.

Try as I might, I couldn’t imagine a way generalize my solution so that it would also work for x==0. Someone suggested that I try the GNU superoptimizer (PDF, code), so I did.

At first I was a bit disappointed that the superoptimizer didn’t discover my sequence for x!=0. I think, maybe, the cost heuristics are outdated. (It should model xor reg,reg as being really cheap†.)

Turns out that the superoptimizer is still really clever anyway. It was a source of some great ideas. I’m delighted with what “we” came up with for x==0:

Old method; naive and literal:

85 c9           test     ecx, ecx
0f 94 c0        sete     al
0f b6 c0        movzx    eax, al

New method:

31 c0           xor      eax, eax
83 f9 01        cmp      ecx, 1
11 c0           adc      eax, eax

Once again the new method avoids the setcc and thus avoids insert semantics. As a nice bonus, we save a byte of code.

†This doesn’t actually depend on the input register at all. It’s essentially a “load-zero” instruction. Modern processors understand this and schedule accordingly.




Creative Commons Attribution-NonCommercial 3.0 United States
Creative Commons Attribution-NonCommercial 3.0 United States