Skip to content

[x86] Unable to fold table lookup into tail call #136848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zmodem opened this issue Apr 23, 2025 · 7 comments · May be fixed by #137643
Open

[x86] Unable to fold table lookup into tail call #136848

zmodem opened this issue Apr 23, 2025 · 7 comments · May be fixed by #137643

Comments

@zmodem
Copy link
Collaborator

zmodem commented Apr 23, 2025

Reported by @haberman

Consider:

typedef void Func(void *table, long idx);

void ExtraMov(Func** table, long idx) {
    return table[idx](table, idx);
}

Actual output:

ExtraMov(void (**)(void*, long), long):
        mov     rax, qword ptr [rdi + 8*rsi]
        jmp     rax

Desired output:

ExtraMov(void (**)(void*, long), long):
        jmp     qword ptr [rdi + 8*rsi]

(GCC can do it: https://godbolt.org/z/73rGW9M45)

(For 32-bit x86 we get it right, but only for -fno-pic; for -fpic we get confused: https://godbolt.org/z/1e9s787Gr)

@llvmbot
Copy link
Member

llvmbot commented Apr 23, 2025

@llvm/issue-subscribers-backend-x86

Author: Hans Wennborg (zmodem)

Reported by @haberman

Consider:

typedef void Func(void *table, long idx);

void ExtraMov(Func** table, long idx) {
    return table[idx](table, idx);
}

Actual output:

ExtraMov(void (**)(void*, long), long):
        mov     rax, qword ptr [rdi + 8*rsi]
        jmp     rax

Desired output:

ExtraMov(void (**)(void*, long), long):
        jmp     qword ptr [rdi + 8*rsi]

(GCC can do it: https://godbolt.org/z/73rGW9M45)

(For 32-bit x86 we get it right, but only for -fno-pic; for -fpic we get confused: https://godbolt.org/z/1e9s787Gr)

@zmodem
Copy link
Collaborator Author

zmodem commented Apr 23, 2025

The tail call instruction we want(?) is

def TCRETURNmi64 : PseudoI<(outs),

it takes an i64mem_TC as input:

// Special i64mem for addresses of load folding tail calls. These are not
// allowed to use callee-saved registers since they must be scheduled
// after callee-saved register are popped.
def i64mem_TC : X86MemOperand<"printqwordmem", X86Mem64AsmOperand, 64> {
let MIOperandInfo = (ops ptr_rc_tailcall, i8imm,
ptr_rc_tailcall, i32imm, SEGMENT_REG);
}

But neither RSI nor RDI are callee-preseved registers, so using them should be fine? The register class is ptr_rc_tailcall:

def ptr_rc_tailcall : PointerLikeRegClass<4>;

The comment says that includes GR64_TC but how does that work? Where is PointerLikeRegClass<4> defined?

Here:

case 4: // Available for tailcall (not callee-saved GPRs).
return getGPRsForTailCall(MF);
}
}
const TargetRegisterClass *
X86RegisterInfo::getGPRsForTailCall(const MachineFunction &MF) const {
const Function &F = MF.getFunction();
if (IsWin64 || (F.getCallingConv() == CallingConv::Win64))
return &X86::GR64_TCW64RegClass;
else if (Is64Bit)
return &X86::GR64_TCRegClass;

okay, and GR64_TC looks like this:

def GR64_TC : RegisterClass<"X86", [i64], 64, (add RAX, RCX, RDX, RSI, RDI,
R8, R9, R11, RIP, RSP)>;

Both RSI and RDI are there, so what's the problem?

@zmodem
Copy link
Collaborator Author

zmodem commented Apr 23, 2025

I think this is where we would fold a tail call to TCRETURNmi64:

// Don't fold loads into X86tcret requiring more than 6 regs.
// There wouldn't be enough scratch registers for base+index.
def : Pat<(X86tcret_6regs (load addr:$dst), timm:$off),
(TCRETURNmi64 addr:$dst, timm:$off)>,
Requires<[In64BitMode, NotUseIndirectThunkCalls]>;

That sounds a bit special.

X86tcret_6regs is defined as:

def X86tcret_6regs : PatFrag<(ops node:$ptr, node:$off),
(X86tcret node:$ptr, node:$off), [{
// X86tcret args: (*chain, ptr, imm, regs..., glue)
unsigned NumRegs = 0;
for (unsigned i = 3, e = N->getNumOperands(); i != e; ++i)
if (isa<RegisterSDNode>(N->getOperand(i)) && ++NumRegs > 6)
return false;
return true;
}]>;

We're not hitting that return false though. And even if I change the pattern like:

diff --git a/llvm/lib/Target/X86/X86InstrCompiler.td b/llvm/lib/Target/X86/X86InstrCompiler.td
index 9687ae29f1c7..0ffc83ca01b3 100644
--- a/llvm/lib/Target/X86/X86InstrCompiler.td
+++ b/llvm/lib/Target/X86/X86InstrCompiler.td
@@ -1344,7 +1344,7 @@ def : Pat<(X86tcret ptr_rc_tailcall:$dst, timm:$off),
 
 // Don't fold loads into X86tcret requiring more than 6 regs.
 // There wouldn't be enough scratch registers for base+index.
-def : Pat<(X86tcret_6regs (load addr:$dst), timm:$off),
+def : Pat<(X86tcret (load addr:$dst), timm:$off),
           (TCRETURNmi64 addr:$dst, timm:$off)>,
           Requires<[In64BitMode, NotUseIndirectThunkCalls]>;

it does not do the fold. Maybe I'm looking at the wrong thing.

@topperc
Copy link
Collaborator

topperc commented Apr 23, 2025

The debug log indicates it is failing OPC_CheckFoldableChainNode

@topperc
Copy link
Collaborator

topperc commented Apr 23, 2025

There's this piece of code in X86DAGToDAGISel::PreprocessISelDAG that is supposed to fix up the chain to make the load foldable.

    if (OptLevel != CodeGenOptLevel::None &&                                     
        // Only do this when the target can fold the load into the call or       
        // jmp.                                                                  
        !Subtarget->useIndirectThunkCalls() &&                                   
        ((N->getOpcode() == X86ISD::CALL && !Subtarget->slowTwoMemOps()) ||      
         (N->getOpcode() == X86ISD::TC_RETURN &&                                 
          (Subtarget->is64Bit() ||                                               
           !getTargetMachine().isPositionIndependent())))) {                     
      /// Also try moving call address load from outside callseq_start to just   
      /// before the call to allow it to be folded.                              
      ///                                                                        
      ///     [Load chain]                                                       
      ///         ^                                                              
      ///         |                                                              
      ///       [Load]                                                           
      ///       ^    ^                                                           
      ///       |    |                                                           
      ///      /      \--                                                        
      ///     /          |                                                       
      ///[CALLSEQ_START] |                                                       
      ///     ^          |                                                       
      ///     |          |                                                       
      /// [LOAD/C2Reg]   |                                                       
      ///     |          |                                                       
      ///      \        /                                                        
      ///       \      /                                                         
      ///       [CALL]                                                           
      bool HasCallSeq = N->getOpcode() == X86ISD::CALL;                          
      SDValue Chain = N->getOperand(0);                                          
      SDValue Load  = N->getOperand(1);                                          
      if (!isCalleeLoad(Load, Chain, HasCallSeq))                                
        continue;                                                                
      moveBelowOrigChain(CurDAG, Load, SDValue(N, 0), Chain);                    
      ++NumLoadMoved;                                                            
      MadeChange = true;                                                         
      continue;                                                                  
    } 

The isCalleeLoad is failing because there are two CopyToReg nodes. If I remove one of the arguments from Func the fold does happen.

@zmodem
Copy link
Collaborator Author

zmodem commented Apr 25, 2025

Thanks for the pointer!

It seems possible to coax isCalleeLoad / moveBelowOrigChain to handle the two-argument case also. But it feels a little bit scary that we're moving the load past CopyToReg node(s) which are targeting registers which the load is taking its inputs from..

Attaching the DAGs from -view-isel-dags for convenience.

The one-argument version (where we successfully move the load next to the call and fold it):

typedef void Func(long idx);
Func** table;

void ExtraMov(long idx) {
    return table[idx](idx);
}

Image

The original test case -- two arguments, where we fail to move/fold the load:

Image

@zmodem
Copy link
Collaborator Author

zmodem commented Apr 25, 2025

But it feels a little bit scary that we're moving the load past CopyToReg node(s) which are targeting registers which the load is taking its inputs from..

.. or never mind. Those inputs come from CopyFromReg's which will still be further up the chain: they come before the position that we move the load from, which mean they will still come before the CopyToReg's which come after.

zmodem added a commit to zmodem/llvm-project that referenced this issue Apr 25, 2025
…two args

this folds more stuff, but also finds new breakages

Fixes llvm#136848
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants