Suboptimal codegen for `llvm.cttz` on x86_64 (no BMI1)

## Summary

Intel has changed it's documentation of it's `bsf` instruction implementation to match AMD's. Intel's `bsf` is now documented to leave the destination register unchanged if the input is zero. This allows a significant performance optimization to  `llvm.cttz.{i16,i32,i64}` that LLVM currently doesn't do. (See "Legality of the optimization" for further caveats.)

## Background

`llvm.cttz.*` relies on the `bsf` instruction on x86 if BMI1 isn't known to be available. It uses `rep bsf` so that CPUs that do support `tzcnt` execute it as `tzcnt` (as the representation is the same), however just in case the target executes it as `bsf`, LLVM generates a branch to handle the case where the input is zero, as unlike `tzcnt`, `bsf` has different behavior for the case where the input is zero.

## What LLVM currently does

[Here's the Godbolt link demonstrating what follows. It's in Rust; hope that's okay.](https://godbolt.org/z/35eov4q5v) (Note that `rustc nightly` uses a very up-to-date version of LLVM if available. It's using 19.1.5 on my system.) 

It tests if the input is zero, then branches.

```asm
tzcnt_u64:
        test    rdi, rdi
        je      .LBB2_1
        rep       bsf rax, rdi
        ret
.LBB2_1:
        mov     eax, 64
        ret
```

(Similarly for i32, but for i16 LLVM uses a different trick to improve `llvm.cttz.i16` at the moment. That trick _could_ be used for 32-bit integers on 64-bit platforms too, I think? Not sure. You can see it occur in the generated assembly for `tzcnt_u16` in the Godbolt link above.)

## The optimized version

If `bsf` leaves the desination register unchanged if the input is zero, then the following implementation of `llvm.cttz.i64` is valid:

```asm
tzcnt_u64:
        mov     eax, 64
        rep       bsf rax, rdi
        ret
```

This is for 64-bit integers. The same can be applied for 32-bit integers and 16-bit integers.

## Legality of the optimization

Of course, if `rep bsf` is run as `tzcnt` then `tzcnt` will write the operand size into the destination register making the `mov` redundant.

The necessary behavior is that `bsf`/`rep bsf` leaves the destination register unchanged.

On x86, it's generally assumed that the destination of the `bsf` instruction is undefined when the input is zero. There are quite a few manufacturers of x86 chips, so it's tougher to make the case that the above holds. So I'm solely focusing on x86_64.

AMD64 implements the desired behavior: https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf

> If the second operand contains 0, the instruction sets ZF to 1 and does not change the contents of the destination register.

Intel used to document the destination register's contents as undefined (but implemented it in the same way as AMD), but it has decided to commit to the same behavior relatively recently: https://cdrdv2.intel.com/v1/dl/getContent/671110

> If the content of the source operand is zero, the destination operand is unmodified. [Footnote: On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.] 

The detail in the footnote does not prevent the optimization (the value 32 will not be overwritten) but I'm including it for completeness.

The only other implementation of x86_64 that I'm aware of is the VIA's Nanos (2008-2015). I can't find an instruction set reference for them. I don't know whether they're relevant, or whether they can be safety excluded as possibilities from certain x86_64 target triples. **This is the crux of the matter, I believe.** I don't know if it's reasonable for LLVM to bother trying to cater to an ISA that requires signing an NDA to get manuals for according to https://forum.osdev.org/viewtopic.php?t=26351 and MIGHT have an implementation that doesn't conform to AMD64 or Intel64 - who knows what they're doing - they probably maintain compilers that work for their implementation anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Suboptimal codegen for `llvm.cttz` on x86_64 (no BMI1) #122004

Summary

Background

What LLVM currently does

The optimized version

Legality of the optimization

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Suboptimal codegen for llvm.cttz on x86_64 (no BMI1) #122004

Description

Summary

Background

What LLVM currently does

The optimized version

Legality of the optimization

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Suboptimal codegen for `llvm.cttz` on x86_64 (no BMI1) #122004