Description
Summary
Intel has changed it's documentation of it's bsf
instruction implementation to match AMD's. Intel's bsf
is now documented to leave the destination register unchanged if the input is zero. This allows a significant performance optimization to llvm.cttz.{i16,i32,i64}
that LLVM currently doesn't do. (See "Legality of the optimization" for further caveats.)
Background
llvm.cttz.*
relies on the bsf
instruction on x86 if BMI1 isn't known to be available. It uses rep bsf
so that CPUs that do support tzcnt
execute it as tzcnt
(as the representation is the same), however just in case the target executes it as bsf
, LLVM generates a branch to handle the case where the input is zero, as unlike tzcnt
, bsf
has different behavior for the case where the input is zero.
What LLVM currently does
Here's the Godbolt link demonstrating what follows. It's in Rust; hope that's okay. (Note that rustc nightly
uses a very up-to-date version of LLVM if available. It's using 19.1.5 on my system.)
It tests if the input is zero, then branches.
tzcnt_u64:
test rdi, rdi
je .LBB2_1
rep bsf rax, rdi
ret
.LBB2_1:
mov eax, 64
ret
(Similarly for i32, but for i16 LLVM uses a different trick to improve llvm.cttz.i16
at the moment. That trick could be used for 32-bit integers on 64-bit platforms too, I think? Not sure. You can see it occur in the generated assembly for tzcnt_u16
in the Godbolt link above.)
The optimized version
If bsf
leaves the desination register unchanged if the input is zero, then the following implementation of llvm.cttz.i64
is valid:
tzcnt_u64:
mov eax, 64
rep bsf rax, rdi
ret
This is for 64-bit integers. The same can be applied for 32-bit integers and 16-bit integers.
Legality of the optimization
Of course, if rep bsf
is run as tzcnt
then tzcnt
will write the operand size into the destination register making the mov
redundant.
The necessary behavior is that bsf
/rep bsf
leaves the destination register unchanged.
On x86, it's generally assumed that the destination of the bsf
instruction is undefined when the input is zero. There are quite a few manufacturers of x86 chips, so it's tougher to make the case that the above holds. So I'm solely focusing on x86_64.
AMD64 implements the desired behavior: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf
If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register.
Intel used to document the destination register's contents as undefined (but implemented it in the same way as AMD), but it has decided to commit to the same behavior relatively recently: https://cdrdv2.intel.com/v1/dl/getContent/671110
If the content of the source operand is zero, the destination operand is unmodified. [Footnote: On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.]
The detail in the footnote does not prevent the optimization (the value 32 will not be overwritten) but I'm including it for completeness.
The only other implementation of x86_64 that I'm aware of is the VIA's Nanos (2008-2015). I can't find an instruction set reference for them. I don't know whether they're relevant, or whether they can be safety excluded as possibilities from certain x86_64 target triples. This is the crux of the matter, I believe. I don't know if it's reasonable for LLVM to bother trying to cater to an ISA that requires signing an NDA to get manuals for according to https://forum.osdev.org/viewtopic.php?t=26351 and MIGHT have an implementation that doesn't conform to AMD64 or Intel64 - who knows what they're doing - they probably maintain compilers that work for their implementation anyway.