Some SIMD architectures have a select instruction which combines two vector registers based on a third vector mask. I lust for such an instruction, but silly little two-operand SSE cannot currently support it.
Of course, people build it out of more primitive ops. For example, here’s what Apple suggests when porting to SSE from Altivec (which supports select):
(Note: this has been translated into Microsoft syntax)
__m128
_mm_sel_ps_apple(const __m128& a, const __m128& b, const __m128& mask)
{
// (b & mask) | (a & ~mask)
return _mm_or_ps( _mm_and_ps( b, mask ), _mm_andnot_ps( mask, a ) );
}
We mask b, inverse-mask a, and slam the result together with an or. This is the easy and obvious solution — and the one I used for years.
Here’s a better way:
__m128
_mm_sel_ps_xor(const __m128& a, const __m128& b, const __m128& mask)
{
// (((b ^ a) & mask)^a)
return _mm_xor_ps( a, _mm_and_ps( mask, _mm_xor_ps( b, a ) ) );
}
Witness the amazing power of xor! Here we calculate the bitwise difference between a and b. Then we mask and selectively apply this difference to convert some of the a’s into b’s. This is freaking genius. (I wish I could claim to have invented it.)
Here’s the assembly that my compiler generates for the two sequences. First, the naive way:
movaps xmm2, XMMWORD PTR [r8] ; mask
movaps xmm1, XMMWORD PTR [rdx] ; b
andps xmm1, xmm2
movaps xmm0, xmm2
andnps xmm0, XMMWORD PTR [rcx] ; a
orps xmm0, xmm1
Go-go gadget bit-twiddler:
movaps xmm1, XMMWORD PTR [rcx] ; a
movaps xmm0, XMMWORD PTR [rdx] ; b
xorps xmm0, xmm1
andps xmm0, XMMWORD PTR [r8] ; mask
xorps xmm0, xmm1
It may not be obvious, but the 2nd sequence is much better because all it’s operations are commutative. Once this little select routine is inlined, a good register allocator will arrange to kill whatever operand may happen to be dead-out. By comparison, the first sequence constrains register allocation with that pesky noncommutative andnps instruction — which has to destroy the mask.
(credit to Jim Conyngham and the MD5 Wikipedia page)
Latest Comments
RSS