I started thinking about this SSE select operation again, probably round about the time I learned of AMD’s SSEPlus and its logical_bitwise_choose function.
(Awesome library, terrible function name. But I digress…)
There are two alternatives at hand. One is the classic, obvious method:
; ((a & MASK) | (b & ~MASK))
; xmm0 = MASK
; xmm1 = a
; xmm2 = b
andps xmm1, xmm0
andnps xmm0, xmm2
orps xmm0, xmm1
The other method uses fancy bit-twiddling:
; (((a ^ b) & MASK) ^ b)
; xmm0 = b
; xmm1 = a
; xmm2 = MASK
xorps xmm0, xmm1
andps xmm0, xmm2
xorps xmm0, xmm1
I’ve long regarded the 2nd sequence as better, but I’m no longer sure. Take a look at the DAGs for the two sequences:
The xor method avoids destroying the mask, but requires three serially-dependent instructions. The “obvious” method provides parallelism, but destroys the mask.
This is an example of a classic compiler phase-ordering problem. Optimal instruction selection depends on knowledge of the “liveness” of the mask variable.

Latest Comments
RSS