I started thinking about this SSE select operation again, probably round about the time I learned of AMD’s SSEPlus and its logical_bitwise_choose function.
(Awesome library, terrible function name. But I digress…)
There are two alternatives at hand. One is the classic, obvious method:
; ((a & MASK) | (b & ~MASK))
; xmm0 = MASK
; xmm1 = a
; xmm2 = b
andps xmm1, xmm0
andnps xmm0, xmm2
orps xmm0, xmm1
The other method uses fancy bit-twiddling:
; (((a ^ b) & MASK) ^ b)
; xmm0 = b
; xmm1 = a
; xmm2 = MASK
xorps xmm0, xmm1
andps xmm0, xmm2
xorps xmm0, xmm1
I’ve long regarded the 2nd sequence as better, but I’m no longer sure. Take a look at the DAGs for the two sequences:
The xor method avoids destroying the mask, but requires three serially-dependent instructions. The “obvious” method provides parallelism, but destroys the mask.
This is an example of a classic compiler phase-ordering problem. Optimal instruction selection depends on knowledge of the “liveness” of the mask variable.

I think your arrows are backward…
You are probably imagining that they indicate data flow, but I read them as “depends on.” Sorry if this is non-standard.
Ahh that makes more sense.
I have updated the function name. It is now logical_bitwise_select().
What do you think it should be called?
I think that’s a better name, although I’m not sure “logical” adds much. What’s the alternative? An arithmetic select? I might just call it bitwise_select.