Skip to content

Suboptimal codegen for llvm.cttz on x86_64 (no BMI1) #122004

Closed
@SFBdragon

Description

@SFBdragon

Summary

Intel has changed it's documentation of it's bsf instruction implementation to match AMD's. Intel's bsf is now documented to leave the destination register unchanged if the input is zero. This allows a significant performance optimization to llvm.cttz.{i16,i32,i64} that LLVM currently doesn't do. (See "Legality of the optimization" for further caveats.)

Background

llvm.cttz.* relies on the bsf instruction on x86 if BMI1 isn't known to be available. It uses rep bsf so that CPUs that do support tzcnt execute it as tzcnt (as the representation is the same), however just in case the target executes it as bsf, LLVM generates a branch to handle the case where the input is zero, as unlike tzcnt, bsf has different behavior for the case where the input is zero.

What LLVM currently does

Here's the Godbolt link demonstrating what follows. It's in Rust; hope that's okay. (Note that rustc nightly uses a very up-to-date version of LLVM if available. It's using 19.1.5 on my system.)

It tests if the input is zero, then branches.

tzcnt_u64:
        test    rdi, rdi
        je      .LBB2_1
        rep       bsf rax, rdi
        ret
.LBB2_1:
        mov     eax, 64
        ret

(Similarly for i32, but for i16 LLVM uses a different trick to improve llvm.cttz.i16 at the moment. That trick could be used for 32-bit integers on 64-bit platforms too, I think? Not sure. You can see it occur in the generated assembly for tzcnt_u16 in the Godbolt link above.)

The optimized version

If bsf leaves the desination register unchanged if the input is zero, then the following implementation of llvm.cttz.i64 is valid:

tzcnt_u64:
        mov     eax, 64
        rep       bsf rax, rdi
        ret

This is for 64-bit integers. The same can be applied for 32-bit integers and 16-bit integers.

Legality of the optimization

Of course, if rep bsf is run as tzcnt then tzcnt will write the operand size into the destination register making the mov redundant.

The necessary behavior is that bsf/rep bsf leaves the destination register unchanged.

On x86, it's generally assumed that the destination of the bsf instruction is undefined when the input is zero. There are quite a few manufacturers of x86 chips, so it's tougher to make the case that the above holds. So I'm solely focusing on x86_64.

AMD64 implements the desired behavior: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf

If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register.

Intel used to document the destination register's contents as undefined (but implemented it in the same way as AMD), but it has decided to commit to the same behavior relatively recently: https://cdrdv2.intel.com/v1/dl/getContent/671110

If the content of the source operand is zero, the destination operand is unmodified. [Footnote: On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.]

The detail in the footnote does not prevent the optimization (the value 32 will not be overwritten) but I'm including it for completeness.

The only other implementation of x86_64 that I'm aware of is the VIA's Nanos (2008-2015). I can't find an instruction set reference for them. I don't know whether they're relevant, or whether they can be safety excluded as possibilities from certain x86_64 target triples. This is the crux of the matter, I believe. I don't know if it's reasonable for LLVM to bother trying to cater to an ISA that requires signing an NDA to get manuals for according to https://forum.osdev.org/viewtopic.php?t=26351 and MIGHT have an implementation that doesn't conform to AMD64 or Intel64 - who knows what they're doing - they probably maintain compilers that work for their implementation anyway.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions