-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[X86] Enable alias analysis (AA) during codegen #123787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-analysis @llvm/pr-subscribers-backend-x86 Author: Benjamin Maxwell (MacDue) ChangesThis can still be disabled by setting the flag test/CodeGen/X86/regalloc-advanced-split-cost.ll Where the spill needed for part of the test disappears with codegen AA enabled (so it is left disabled for that test). Enabling AA during codegen makes X86 consistent with other targets such as AArch64 and RISC-V. This will avoid regressing x86 targets when using the new Patch is 80.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/123787.diff 31 Files Affected:
diff --git a/llvm/lib/Target/X86/X86Subtarget.cpp b/llvm/lib/Target/X86/X86Subtarget.cpp
index b563f6ebce34e9..11327ee69a5546 100644
--- a/llvm/lib/Target/X86/X86Subtarget.cpp
+++ b/llvm/lib/Target/X86/X86Subtarget.cpp
@@ -54,6 +54,8 @@ static cl::opt<bool>
X86EarlyIfConv("x86-early-ifcvt", cl::Hidden,
cl::desc("Enable early if-conversion on X86"));
+static cl::opt<bool> UseAA("x86-use-aa", cl::init(true),
+ cl::desc("Enable the use of AA during codegen."));
/// Classify a blockaddress reference for the current subtarget according to how
/// we should reference it in a non-pcrel context.
@@ -320,6 +322,8 @@ void X86Subtarget::initSubtargetFeatures(StringRef CPU, StringRef TuneCPU,
PreferVectorWidth = 256;
}
+bool X86Subtarget::useAA() const { return UseAA; }
+
X86Subtarget &X86Subtarget::initializeSubtargetDependencies(StringRef CPU,
StringRef TuneCPU,
StringRef FS) {
diff --git a/llvm/lib/Target/X86/X86Subtarget.h b/llvm/lib/Target/X86/X86Subtarget.h
index e3cb9ee8ce1909..e2169c6b8d5e08 100644
--- a/llvm/lib/Target/X86/X86Subtarget.h
+++ b/llvm/lib/Target/X86/X86Subtarget.h
@@ -155,6 +155,8 @@ class X86Subtarget final : public X86GenSubtargetInfo {
const LegalizerInfo *getLegalizerInfo() const override;
const RegisterBankInfo *getRegBankInfo() const override;
+ bool useAA() const override;
+
private:
/// Initialize the full set of dependencies so we can use an initializer
/// list for X86Subtarget.
diff --git a/llvm/test/CodeGen/X86/2007-10-12-SpillerUnfold1.ll b/llvm/test/CodeGen/X86/2007-10-12-SpillerUnfold1.ll
index d77d4352f8336c..8ecf9e7a8fccd9 100644
--- a/llvm/test/CodeGen/X86/2007-10-12-SpillerUnfold1.ll
+++ b/llvm/test/CodeGen/X86/2007-10-12-SpillerUnfold1.ll
@@ -4,32 +4,23 @@
define fastcc void @fht(ptr %fz, i16 signext %n) {
; CHECK-LABEL: fht:
; CHECK: # %bb.0: # %entry
-; CHECK-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero
+; CHECK-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; CHECK-NEXT: xorps %xmm0, %xmm0
; CHECK-NEXT: xorps %xmm1, %xmm1
-; CHECK-NEXT: subss %xmm3, %xmm1
-; CHECK-NEXT: movaps %xmm3, %xmm4
-; CHECK-NEXT: mulss %xmm0, %xmm4
-; CHECK-NEXT: addss %xmm3, %xmm4
-; CHECK-NEXT: movaps %xmm3, %xmm2
-; CHECK-NEXT: subss %xmm4, %xmm2
-; CHECK-NEXT: addss %xmm3, %xmm4
-; CHECK-NEXT: xorps %xmm5, %xmm5
-; CHECK-NEXT: subss %xmm1, %xmm5
+; CHECK-NEXT: subss %xmm2, %xmm1
+; CHECK-NEXT: movaps %xmm2, %xmm3
+; CHECK-NEXT: mulss %xmm0, %xmm3
+; CHECK-NEXT: addss %xmm2, %xmm3
+; CHECK-NEXT: movaps %xmm2, %xmm4
+; CHECK-NEXT: subss %xmm3, %xmm4
; CHECK-NEXT: addss %xmm0, %xmm1
-; CHECK-NEXT: mulss %xmm0, %xmm4
-; CHECK-NEXT: mulss %xmm0, %xmm5
-; CHECK-NEXT: addss %xmm4, %xmm5
-; CHECK-NEXT: addss %xmm0, %xmm5
-; CHECK-NEXT: movss %xmm5, 0
-; CHECK-NEXT: movss %xmm3, (%ecx)
-; CHECK-NEXT: addss %xmm0, %xmm3
-; CHECK-NEXT: movss %xmm3, 0
-; CHECK-NEXT: mulss %xmm0, %xmm1
-; CHECK-NEXT: mulss %xmm0, %xmm2
-; CHECK-NEXT: addss %xmm1, %xmm2
; CHECK-NEXT: addss %xmm0, %xmm2
-; CHECK-NEXT: movss %xmm2, (%ecx)
+; CHECK-NEXT: movss %xmm2, 0
+; CHECK-NEXT: mulss %xmm0, %xmm1
+; CHECK-NEXT: mulss %xmm0, %xmm4
+; CHECK-NEXT: addss %xmm1, %xmm4
+; CHECK-NEXT: addss %xmm0, %xmm4
+; CHECK-NEXT: movss %xmm4, (%ecx)
; CHECK-NEXT: retl
entry:
br i1 true, label %bb171.preheader, label %bb431
diff --git a/llvm/test/CodeGen/X86/2008-03-31-SpillerFoldingBug.ll b/llvm/test/CodeGen/X86/2008-03-31-SpillerFoldingBug.ll
index 180d6719837b26..c4afa2ae393ca6 100644
--- a/llvm/test/CodeGen/X86/2008-03-31-SpillerFoldingBug.ll
+++ b/llvm/test/CodeGen/X86/2008-03-31-SpillerFoldingBug.ll
@@ -34,7 +34,6 @@ define void @_GLOBAL__I__ZN5Pooma5pinfoE() nounwind {
; CHECK-NEXT: movl %eax, %esi
; CHECK-NEXT: movl $0, (%esp)
; CHECK-NEXT: calll __ZNSt8ios_baseC2Ev
-; CHECK-NEXT: movl $0, 0
; CHECK-NEXT: addl $12, %ebx
; CHECK-NEXT: movl %ebx, (%esi)
; CHECK-NEXT: movl L__ZTVSt15basic_streambufIcSt11char_traitsIcEE$non_lazy_ptr-L0$pb(%edi), %eax
diff --git a/llvm/test/CodeGen/X86/MergeConsecutiveStores.ll b/llvm/test/CodeGen/X86/MergeConsecutiveStores.ll
index 0103d2bf3cc2c6..d2fe4897c18450 100644
--- a/llvm/test/CodeGen/X86/MergeConsecutiveStores.ll
+++ b/llvm/test/CodeGen/X86/MergeConsecutiveStores.ll
@@ -402,9 +402,9 @@ define void @merge_loads_i16(i32 %count, ptr noalias nocapture %q, ptr noalias n
define void @no_merge_loads(i32 %count, ptr noalias nocapture %q, ptr noalias nocapture %p) nounwind uwtable noinline ssp {
; X86-BWON-LABEL: no_merge_loads:
; X86-BWON: # %bb.0:
-; X86-BWON-NEXT: pushl %ebx
+; X86-BWON-NEXT: pushl %esi
; X86-BWON-NEXT: .cfi_def_cfa_offset 8
-; X86-BWON-NEXT: .cfi_offset %ebx, -8
+; X86-BWON-NEXT: .cfi_offset %esi, -8
; X86-BWON-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-BWON-NEXT: testl %eax, %eax
; X86-BWON-NEXT: jle .LBB5_3
@@ -414,23 +414,21 @@ define void @no_merge_loads(i32 %count, ptr noalias nocapture %q, ptr noalias no
; X86-BWON-NEXT: .p2align 4
; X86-BWON-NEXT: .LBB5_2: # %a4
; X86-BWON-NEXT: # =>This Inner Loop Header: Depth=1
-; X86-BWON-NEXT: movzbl (%edx), %ebx
-; X86-BWON-NEXT: movb %bl, (%ecx)
-; X86-BWON-NEXT: movzbl 1(%edx), %ebx
-; X86-BWON-NEXT: movb %bl, 1(%ecx)
+; X86-BWON-NEXT: movzwl (%edx), %esi
+; X86-BWON-NEXT: movw %si, (%ecx)
; X86-BWON-NEXT: addl $8, %ecx
; X86-BWON-NEXT: decl %eax
; X86-BWON-NEXT: jne .LBB5_2
; X86-BWON-NEXT: .LBB5_3: # %._crit_edge
-; X86-BWON-NEXT: popl %ebx
+; X86-BWON-NEXT: popl %esi
; X86-BWON-NEXT: .cfi_def_cfa_offset 4
; X86-BWON-NEXT: retl
;
; X86-BWOFF-LABEL: no_merge_loads:
; X86-BWOFF: # %bb.0:
-; X86-BWOFF-NEXT: pushl %ebx
+; X86-BWOFF-NEXT: pushl %esi
; X86-BWOFF-NEXT: .cfi_def_cfa_offset 8
-; X86-BWOFF-NEXT: .cfi_offset %ebx, -8
+; X86-BWOFF-NEXT: .cfi_offset %esi, -8
; X86-BWOFF-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-BWOFF-NEXT: testl %eax, %eax
; X86-BWOFF-NEXT: jle .LBB5_3
@@ -440,15 +438,13 @@ define void @no_merge_loads(i32 %count, ptr noalias nocapture %q, ptr noalias no
; X86-BWOFF-NEXT: .p2align 4
; X86-BWOFF-NEXT: .LBB5_2: # %a4
; X86-BWOFF-NEXT: # =>This Inner Loop Header: Depth=1
-; X86-BWOFF-NEXT: movb (%edx), %bl
-; X86-BWOFF-NEXT: movb %bl, (%ecx)
-; X86-BWOFF-NEXT: movb 1(%edx), %bl
-; X86-BWOFF-NEXT: movb %bl, 1(%ecx)
+; X86-BWOFF-NEXT: movw (%edx), %si
+; X86-BWOFF-NEXT: movw %si, (%ecx)
; X86-BWOFF-NEXT: addl $8, %ecx
; X86-BWOFF-NEXT: decl %eax
; X86-BWOFF-NEXT: jne .LBB5_2
; X86-BWOFF-NEXT: .LBB5_3: # %._crit_edge
-; X86-BWOFF-NEXT: popl %ebx
+; X86-BWOFF-NEXT: popl %esi
; X86-BWOFF-NEXT: .cfi_def_cfa_offset 4
; X86-BWOFF-NEXT: retl
;
@@ -459,10 +455,8 @@ define void @no_merge_loads(i32 %count, ptr noalias nocapture %q, ptr noalias no
; X64-BWON-NEXT: .p2align 4
; X64-BWON-NEXT: .LBB5_1: # %a4
; X64-BWON-NEXT: # =>This Inner Loop Header: Depth=1
-; X64-BWON-NEXT: movzbl (%rsi), %eax
-; X64-BWON-NEXT: movb %al, (%rdx)
-; X64-BWON-NEXT: movzbl 1(%rsi), %eax
-; X64-BWON-NEXT: movb %al, 1(%rdx)
+; X64-BWON-NEXT: movzwl (%rsi), %eax
+; X64-BWON-NEXT: movw %ax, (%rdx)
; X64-BWON-NEXT: addq $8, %rdx
; X64-BWON-NEXT: decl %edi
; X64-BWON-NEXT: jne .LBB5_1
@@ -476,10 +470,8 @@ define void @no_merge_loads(i32 %count, ptr noalias nocapture %q, ptr noalias no
; X64-BWOFF-NEXT: .p2align 4
; X64-BWOFF-NEXT: .LBB5_1: # %a4
; X64-BWOFF-NEXT: # =>This Inner Loop Header: Depth=1
-; X64-BWOFF-NEXT: movb (%rsi), %al
-; X64-BWOFF-NEXT: movb %al, (%rdx)
-; X64-BWOFF-NEXT: movb 1(%rsi), %al
-; X64-BWOFF-NEXT: movb %al, 1(%rdx)
+; X64-BWOFF-NEXT: movw (%rsi), %ax
+; X64-BWOFF-NEXT: movw %ax, (%rdx)
; X64-BWOFF-NEXT: addq $8, %rdx
; X64-BWOFF-NEXT: decl %edi
; X64-BWOFF-NEXT: jne .LBB5_1
@@ -858,26 +850,26 @@ define void @MergeLoadStoreBaseIndexOffsetComplicated(ptr %a, ptr %b, ptr %c, i6
; X86-BWON-NEXT: .cfi_offset %edi, -16
; X86-BWON-NEXT: .cfi_offset %ebx, -12
; X86-BWON-NEXT: .cfi_offset %ebp, -8
-; X86-BWON-NEXT: xorl %eax, %eax
-; X86-BWON-NEXT: movl {{[0-9]+}}(%esp), %esi
+; X86-BWON-NEXT: xorl %esi, %esi
; X86-BWON-NEXT: movl {{[0-9]+}}(%esp), %edi
; X86-BWON-NEXT: movl {{[0-9]+}}(%esp), %ebx
; X86-BWON-NEXT: xorl %ebp, %ebp
; X86-BWON-NEXT: .p2align 4
; X86-BWON-NEXT: .LBB10_1: # =>This Inner Loop Header: Depth=1
; X86-BWON-NEXT: movsbl (%edi), %ecx
-; X86-BWON-NEXT: movzbl (%esi,%ecx), %edx
-; X86-BWON-NEXT: movzbl 1(%esi,%ecx), %ecx
-; X86-BWON-NEXT: movb %dl, (%ebx,%eax)
-; X86-BWON-NEXT: movl %eax, %edx
-; X86-BWON-NEXT: orl $1, %edx
-; X86-BWON-NEXT: movb %cl, (%ebx,%edx)
+; X86-BWON-NEXT: movl {{[0-9]+}}(%esp), %eax
+; X86-BWON-NEXT: movzbl (%eax,%ecx), %edx
+; X86-BWON-NEXT: movzbl 1(%eax,%ecx), %ecx
+; X86-BWON-NEXT: movl %esi, %eax
+; X86-BWON-NEXT: orl $1, %eax
+; X86-BWON-NEXT: movb %cl, (%ebx,%eax)
+; X86-BWON-NEXT: movb %dl, (%ebx,%esi)
; X86-BWON-NEXT: incl %edi
-; X86-BWON-NEXT: addl $2, %eax
+; X86-BWON-NEXT: addl $2, %esi
; X86-BWON-NEXT: adcl $0, %ebp
-; X86-BWON-NEXT: cmpl {{[0-9]+}}(%esp), %eax
-; X86-BWON-NEXT: movl %ebp, %ecx
-; X86-BWON-NEXT: sbbl {{[0-9]+}}(%esp), %ecx
+; X86-BWON-NEXT: cmpl {{[0-9]+}}(%esp), %esi
+; X86-BWON-NEXT: movl %ebp, %eax
+; X86-BWON-NEXT: sbbl {{[0-9]+}}(%esp), %eax
; X86-BWON-NEXT: jl .LBB10_1
; X86-BWON-NEXT: # %bb.2:
; X86-BWON-NEXT: popl %esi
@@ -904,26 +896,26 @@ define void @MergeLoadStoreBaseIndexOffsetComplicated(ptr %a, ptr %b, ptr %c, i6
; X86-BWOFF-NEXT: .cfi_offset %edi, -16
; X86-BWOFF-NEXT: .cfi_offset %ebx, -12
; X86-BWOFF-NEXT: .cfi_offset %ebp, -8
-; X86-BWOFF-NEXT: xorl %eax, %eax
-; X86-BWOFF-NEXT: movl {{[0-9]+}}(%esp), %esi
+; X86-BWOFF-NEXT: xorl %esi, %esi
; X86-BWOFF-NEXT: movl {{[0-9]+}}(%esp), %edi
; X86-BWOFF-NEXT: movl {{[0-9]+}}(%esp), %ebx
; X86-BWOFF-NEXT: xorl %ebp, %ebp
; X86-BWOFF-NEXT: .p2align 4
; X86-BWOFF-NEXT: .LBB10_1: # =>This Inner Loop Header: Depth=1
; X86-BWOFF-NEXT: movsbl (%edi), %ecx
-; X86-BWOFF-NEXT: movb (%esi,%ecx), %dl
-; X86-BWOFF-NEXT: movb 1(%esi,%ecx), %cl
-; X86-BWOFF-NEXT: movb %dl, (%ebx,%eax)
-; X86-BWOFF-NEXT: movl %eax, %edx
-; X86-BWOFF-NEXT: orl $1, %edx
-; X86-BWOFF-NEXT: movb %cl, (%ebx,%edx)
+; X86-BWOFF-NEXT: movl {{[0-9]+}}(%esp), %eax
+; X86-BWOFF-NEXT: movb (%eax,%ecx), %dl
+; X86-BWOFF-NEXT: movb 1(%eax,%ecx), %cl
+; X86-BWOFF-NEXT: movl %esi, %eax
+; X86-BWOFF-NEXT: orl $1, %eax
+; X86-BWOFF-NEXT: movb %cl, (%ebx,%eax)
+; X86-BWOFF-NEXT: movb %dl, (%ebx,%esi)
; X86-BWOFF-NEXT: incl %edi
-; X86-BWOFF-NEXT: addl $2, %eax
+; X86-BWOFF-NEXT: addl $2, %esi
; X86-BWOFF-NEXT: adcl $0, %ebp
-; X86-BWOFF-NEXT: cmpl {{[0-9]+}}(%esp), %eax
-; X86-BWOFF-NEXT: movl %ebp, %ecx
-; X86-BWOFF-NEXT: sbbl {{[0-9]+}}(%esp), %ecx
+; X86-BWOFF-NEXT: cmpl {{[0-9]+}}(%esp), %esi
+; X86-BWOFF-NEXT: movl %ebp, %eax
+; X86-BWOFF-NEXT: sbbl {{[0-9]+}}(%esp), %eax
; X86-BWOFF-NEXT: jl .LBB10_1
; X86-BWOFF-NEXT: # %bb.2:
; X86-BWOFF-NEXT: popl %esi
diff --git a/llvm/test/CodeGen/X86/addcarry.ll b/llvm/test/CodeGen/X86/addcarry.ll
index f8d32fc2d29252..ce1bf72d70a738 100644
--- a/llvm/test/CodeGen/X86/addcarry.ll
+++ b/llvm/test/CodeGen/X86/addcarry.ll
@@ -1155,14 +1155,14 @@ define void @PR39464(ptr noalias nocapture sret(%struct.U192) %0, ptr nocapture
; CHECK: # %bb.0:
; CHECK-NEXT: movq %rdi, %rax
; CHECK-NEXT: movq (%rsi), %rcx
+; CHECK-NEXT: movq 8(%rsi), %rdi
; CHECK-NEXT: addq (%rdx), %rcx
-; CHECK-NEXT: movq %rcx, (%rdi)
-; CHECK-NEXT: movq 8(%rsi), %rcx
-; CHECK-NEXT: adcq 8(%rdx), %rcx
-; CHECK-NEXT: movq %rcx, 8(%rdi)
+; CHECK-NEXT: movq %rcx, (%rax)
+; CHECK-NEXT: adcq 8(%rdx), %rdi
+; CHECK-NEXT: movq %rdi, 8(%rax)
; CHECK-NEXT: movq 16(%rsi), %rcx
; CHECK-NEXT: adcq 16(%rdx), %rcx
-; CHECK-NEXT: movq %rcx, 16(%rdi)
+; CHECK-NEXT: movq %rcx, 16(%rax)
; CHECK-NEXT: retq
%4 = load i64, ptr %1, align 8
%5 = load i64, ptr %2, align 8
diff --git a/llvm/test/CodeGen/X86/avoid-sfb.ll b/llvm/test/CodeGen/X86/avoid-sfb.ll
index 22b4fddf88e457..29de9b6e68b22b 100644
--- a/llvm/test/CodeGen/X86/avoid-sfb.ll
+++ b/llvm/test/CodeGen/X86/avoid-sfb.ll
@@ -418,18 +418,18 @@ define void @test_multiple_blocks(ptr nocapture noalias %s1, ptr nocapture %s2)
; CHECK-NEXT: movl $0, 36(%rdi)
; CHECK-NEXT: movups 16(%rdi), %xmm0
; CHECK-NEXT: movups %xmm0, 16(%rsi)
-; CHECK-NEXT: movl 32(%rdi), %eax
-; CHECK-NEXT: movl %eax, 32(%rsi)
-; CHECK-NEXT: movl 36(%rdi), %eax
-; CHECK-NEXT: movl %eax, 36(%rsi)
-; CHECK-NEXT: movq 40(%rdi), %rax
-; CHECK-NEXT: movq %rax, 40(%rsi)
; CHECK-NEXT: movl (%rdi), %eax
; CHECK-NEXT: movl %eax, (%rsi)
; CHECK-NEXT: movl 4(%rdi), %eax
; CHECK-NEXT: movl %eax, 4(%rsi)
; CHECK-NEXT: movq 8(%rdi), %rax
; CHECK-NEXT: movq %rax, 8(%rsi)
+; CHECK-NEXT: movl 32(%rdi), %eax
+; CHECK-NEXT: movl %eax, 32(%rsi)
+; CHECK-NEXT: movl 36(%rdi), %eax
+; CHECK-NEXT: movl %eax, 36(%rsi)
+; CHECK-NEXT: movq 40(%rdi), %rax
+; CHECK-NEXT: movq %rax, 40(%rsi)
; CHECK-NEXT: retq
;
; DISABLED-LABEL: test_multiple_blocks:
@@ -438,33 +438,11 @@ define void @test_multiple_blocks(ptr nocapture noalias %s1, ptr nocapture %s2)
; DISABLED-NEXT: movl $0, 36(%rdi)
; DISABLED-NEXT: movups 16(%rdi), %xmm0
; DISABLED-NEXT: movups %xmm0, 16(%rsi)
-; DISABLED-NEXT: movups 32(%rdi), %xmm0
-; DISABLED-NEXT: movups %xmm0, 32(%rsi)
; DISABLED-NEXT: movups (%rdi), %xmm0
; DISABLED-NEXT: movups %xmm0, (%rsi)
+; DISABLED-NEXT: movups 32(%rdi), %xmm0
+; DISABLED-NEXT: movups %xmm0, 32(%rsi)
; DISABLED-NEXT: retq
-;
-; AVX-LABEL: test_multiple_blocks:
-; AVX: # %bb.0: # %entry
-; AVX-NEXT: movl $0, 4(%rdi)
-; AVX-NEXT: movl $0, 36(%rdi)
-; AVX-NEXT: vmovups 16(%rdi), %xmm0
-; AVX-NEXT: vmovups %xmm0, 16(%rsi)
-; AVX-NEXT: movl 32(%rdi), %eax
-; AVX-NEXT: movl %eax, 32(%rsi)
-; AVX-NEXT: movl 36(%rdi), %eax
-; AVX-NEXT: movl %eax, 36(%rsi)
-; AVX-NEXT: movq 40(%rdi), %rax
-; AVX-NEXT: movq %rax, 40(%rsi)
-; AVX-NEXT: movl (%rdi), %eax
-; AVX-NEXT: movl %eax, (%rsi)
-; AVX-NEXT: movl 4(%rdi), %eax
-; AVX-NEXT: movl %eax, 4(%rsi)
-; AVX-NEXT: vmovups 8(%rdi), %xmm0
-; AVX-NEXT: vmovups %xmm0, 8(%rsi)
-; AVX-NEXT: movq 24(%rdi), %rax
-; AVX-NEXT: movq %rax, 24(%rsi)
-; AVX-NEXT: retq
entry:
%b = getelementptr inbounds %struct.S4, ptr %s1, i64 0, i32 1
store i32 0, ptr %b, align 4
@@ -547,62 +525,26 @@ if.end: ; preds = %if.then, %entry
; Function Attrs: nounwind uwtable
define void @test_stack(ptr noalias nocapture sret(%struct.S6) %agg.result, ptr byval(%struct.S6) nocapture readnone align 8 %s1, ptr byval(%struct.S6) nocapture align 8 %s2, i32 %x) local_unnamed_addr #0 {
-; CHECK-LABEL: test_stack:
-; CHECK: # %bb.0: # %entry
-; CHECK-NEXT: movq %rdi, %rax
-; CHECK-NEXT: movl %esi, {{[0-9]+}}(%rsp)
-; CHECK-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
-; CHECK-NEXT: movups %xmm0, (%rdi)
-; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rcx
-; CHECK-NEXT: movq %rcx, 16(%rdi)
-; CHECK-NEXT: movl {{[0-9]+}}(%rsp), %ecx
-; CHECK-NEXT: movl %ecx, 24(%rdi)
-; CHECK-NEXT: movl {{[0-9]+}}(%rsp), %ecx
-; CHECK-NEXT: movl %ecx, 28(%rdi)
-; CHECK-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
-; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rcx
-; CHECK-NEXT: movl {{[0-9]+}}(%rsp), %edx
-; CHECK-NEXT: movl {{[0-9]+}}(%rsp), %esi
-; CHECK-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
-; CHECK-NEXT: movq %rcx, {{[0-9]+}}(%rsp)
-; CHECK-NEXT: movl %edx, {{[0-9]+}}(%rsp)
-; CHECK-NEXT: movl %esi, {{[0-9]+}}(%rsp)
-; CHECK-NEXT: retq
-;
-; DISABLED-LABEL: test_stack:
-; DISABLED: # %bb.0: # %entry
-; DISABLED-NEXT: movq %rdi, %rax
-; DISABLED-NEXT: movl %esi, {{[0-9]+}}(%rsp)
-; DISABLED-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
-; DISABLED-NEXT: movups %xmm0, (%rdi)
-; DISABLED-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
-; DISABLED-NEXT: movups %xmm0, 16(%rdi)
-; DISABLED-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
-; DISABLED-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1
-; DISABLED-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
-; DISABLED-NEXT: movaps %xmm1, {{[0-9]+}}(%rsp)
-; DISABLED-NEXT: retq
+; SSE-LABEL: test_stack:
+; SSE: # %bb.0: # %entry
+; SSE-NEXT: movq %rdi, %rax
+; SSE-NEXT: movl %esi, {{[0-9]+}}(%rsp)
+; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
+; SSE-NEXT: movups %xmm0, (%rdi)
+; SSE-NEXT: movaps {{[0-9]+}}(%rsp), %xmm1
+; SSE-NEXT: movups %xmm1, 16(%rdi)
+; SSE-NEXT: movaps %xmm0, {{[0-9]+}}(%rsp)
+; SSE-NEXT: movaps %xmm1, {{[0-9]+}}(%rsp)
+; SSE-NEXT: retq
;
; AVX-LABEL: test_stack:
; AVX: # %bb.0: # %entry
; AVX-NEXT: movq %rdi, %rax
; AVX-NEXT: movl %esi, {{[0-9]+}}(%rsp)
-; AVX-NEXT: vmovups {{[0-9]+}}(%rsp), %xmm0
-; AVX-NEXT: vmovups %xmm0, (%rdi)
-; AVX-NEXT: movq {{[0-9]+}}(%rsp), %rcx
-; AVX-NEXT: movq %rcx, 16(%rdi)
-; AVX-NEXT: movl {{[0-9]+}}(%rsp), %ecx
-; AVX-NEXT: movl %ecx, 24(%rdi)
-; AVX-NEXT: movl {{[0-9]+}}(%rsp), %ecx
-; AVX-NEXT: movl %ecx, 28(%rdi)
-; AVX-NEXT: vmovups {{[0-9]+}}(%rsp), %xmm0
-; AVX-NEXT: vmovups %xmm0, {{[0-9]+}}(%rsp)
-; AVX-NEXT: movq {{[0-9]+}}(%rsp), %rcx
-; AVX-NEXT: movq %rcx, {{[0-9]+}}(%rsp)
-; AVX-NEXT: movl {{[0-9]+}}(%rsp), %ecx
-; AVX-NEXT: movl %ecx, {{[0-9]+}}(%rsp)
-; AVX-NEXT: movl {{[0-9]+}}(%rsp), %ecx
-; AVX-NEXT: movl %ecx, {{[0-9]+}}(%rsp)
+; AVX-NEXT: vmovups {{[0-9]+}}(%rsp), %ymm0
+; AVX-NEXT: vmovups %ymm0, (%rdi)
+; AVX-NEXT: vmovups %ymm0, {{[0-9]+}}(%rsp)
+; AVX-NEXT: vzeroupper
; AVX-NEXT: retq
entry:
%s6.sroa.3.0..sroa_idx4 = getelementptr inbounds %struct.S6, ptr %s2, i64 0, i32 3
diff --git a/llvm/test/CodeGen/X86/cet_endbr_imm_enhance.ll b/llvm/test/CodeGen/X86/cet_endbr_imm_enhance.ll
index 98d315ad14e684..ea8c1b63869834 100644
--- a/llvm/test/CodeGen/X86/cet_endbr_imm_enhance.ll
+++ b/llvm/test/CodeGen/X86/cet_endbr_imm_enhance.ll
@@ -29,9 +29,8 @@ define dso_local i64 @foo(ptr %azx) #0 {
; CHECK-NEXT: movq %rdi, -{{[0-9]+}}(%rsp)
; CHECK-NEXT: movabsq $-321002333478651, %rax # imm = 0xFFFEDC0CD1F0E105
; CHECK-NEXT: notq %rax
-; CHECK-NEXT: andq %rax, (%rdi)
-; CHECK-NEXT: movq -{{[0-9]+}}(%rsp), %rax
-; CHECK-NEXT: movq (%rax), %rax
+; CHECK-NEXT: andq (%rdi), %rax
+; CHECK-NEXT: movq %rax, (%rdi)
; CHECK-NEXT: retq
entry:
%azx.addr = alloca ptr, align 8
diff --git a/llvm/test/CodeGen/X86/cfguard-x86-vectorcall.ll b/llvm/test/CodeGen/X86/cfguard-x86-vectorcall.ll
index a75973310d15cc..8e63e19cc3d1cf 100644
--- a/llvm/test/CodeGen/X86/cfguard-x86-vectorcall.ll
+++ b/llvm/test/CodeGen/X86/cfguard-x86-vectorcall.ll
@@ -15,12 +15,12 @@ entry:
; X86-LABEL: func_cf_vector_x86
; X86: movl 12(%ebp), %eax
; X86: movl 8(%ebp), %ecx
- ; X86: movsd 24(%eax), %xmm4 # xmm4 = mem[0],zero
+ ; X86: movsd 24(%eax), %xmm4 # xmm4 = mem[0],zero
+ ; X86: movsd 16(%eax), %xmm5 # xmm5 = mem[0],zero
+ ; X86: movsd (%eax), %xmm6 # xmm6 = mem[0],zero
+ ; X86: movsd 8(%eax), %xmm7 # xmm7 = mem[0],zero
; X86: movsd %xmm4, 24(%esp)
- ; X86: movsd 16(%eax), %xmm5 # xmm5 = mem[0],zero
; X86: movsd %xmm5, 16(%esp)
- ; X86: movsd (%eax), %xmm6 # xmm6 = mem[0],zero
- ; X86: movsd 8(%eax), %xmm7 # xmm7 = mem[0],zero
; X86: movsd %xmm7, 8(%esp)
; X86: movsd %xmm6, (%esp)
; X86: calll *___guard_check_icall_fptr
@@ -29,6 +29,7 @@ entry:
; X86: movaps %xmm5, %xmm2
; X86: movaps %xmm4, %xmm3
; X86: calll *%ecx
+
}
attributes #0 = { "target-cpu"="pentium4" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" }
diff --git a/llvm/tes...
[truncated]
|
; BWON: movzbl | ||
; BWOFF: movb | ||
; BWON: movzwl | ||
; BWOFF: movw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm that this is all the memory traffic ? I'd expect the load-store-load stages to still be present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it appears the two movzbl
and movb
have been merged into one movzwl
and one movw
:
Before:
LBB0_1: ## %a4
## =>This Inner Loop Header: Depth=1
movzbl (%rsi), %eax
movb %al, (%rdx)
movzbl 1(%rsi), %eax
movb %al, 1(%rdx)
addq $8, %rdx
decl %edi
jne LBB0_1
After:
LBB0_1: ## %a4
## =>This Inner Loop Header: Depth=1
movzwl (%rsi), %eax
movw %ax, (%rdx)
addq $8, %rdx
decl %edi
jne LBB0_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I'm going to cleanup the test file and regenerate the check so its easier to update in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like enabling use of AA in codegen is very expensive: http://llvm-compile-time-tracker.com/compare.php?from=2d317d903a6c469d4bf64298b21b6dac83f1fc8b&to=b3ab8b78fc3f3bed0347a69bf035741a03862c94&stat=instructions:u image.c from lencod may be a good test case, as it increases by 120%.
One obvious improvement would be to make use of BatchAA, as we're not modifying IR at this point, so using caching should be safe. But probably we also have some places that perform an unreachable number of AA queries, relying on them being very cheap right now.
I've pushed a quick test commit that changes the DAG to use BatchAAResults, and that didn't change any tests for any targets I build (AArch64, RISC-V, x86). I'd be curious to know how (if at all) that affects your benchmarks (if it's not too much trouble to run them again). |
@MacDue It's a nice improvement: https://llvm-compile-time-tracker.com/compare.php?from=b3ab8b78fc3f3bed0347a69bf035741a03862c94&to=033a963fb3f69491c8ada5422593b93d0b87f438&stat=instructions:u Can you please submit it as a separate PR? |
@@ -249,7 +249,8 @@ namespace { | |||
public: | |||
DAGCombiner(SelectionDAG &D, AliasAnalysis *AA, CodeGenOptLevel OL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to directly accept a BatchAA argument here. That way you can reuse the same BatchAA instance across multiple DAGCombine runs (we do 2-4 per function).
Sure, will do 👍 |
Once we get to SelectionDAG the IR should not be changing anymore, so we can use BatchAAResults rather than AAResults to cache AA queries. This should be a NFC change for targets that enable AA during codegen (such as AArch64), but also give a nice compile-time improvement in some cases. See: llvm#123787 (comment) Note: This follows Nikita's suggestion on llvm#123787.
Created a PR for the BatchAAResults changes here: #123934 |
…123934) Once we get to SelectionDAG the IR should not be changing anymore, so we can use BatchAAResults rather than AAResults to cache AA queries. This should be a NFC change for targets that enable AA during codegen (such as AArch64), but also give a nice compile-time improvement in some cases. See: #123787 (comment) Note: This follows Nikita's suggestion on #123787.
… results (#123934) Once we get to SelectionDAG the IR should not be changing anymore, so we can use BatchAAResults rather than AAResults to cache AA queries. This should be a NFC change for targets that enable AA during codegen (such as AArch64), but also give a nice compile-time improvement in some cases. See: llvm/llvm-project#123787 (comment) Note: This follows Nikita's suggestion on #123787.
This can still be disabled by setting the flag `-x86-use-aa=false`. All tests have been updated to account for this change except: test/CodeGen/X86/regalloc-advanced-split-cost.ll Where the spill needed for part of the test disappears with codegen AA enabled (so it is left disabled for that test). Enabling AA during codegen makes X86 consistent with other targets such as AArch64 and RISC-V. This will avoid regressing x86 targets when using the new `llvm.sincos` intrinsic see: llvm#121763
7373924
to
1888e47
Compare
Rebased this on #123934 (which should reuse the AA query cache a little more, since it's reused across DAG combines now). It likely still costs a few % (based on the previous change https://llvm-compile-time-tracker.com/compare.php?from=2d317d903a6c469d4bf64298b21b6dac83f1fc8b&to=033a963fb3f69491c8ada5422593b93d0b87f438&stat=instructions%3Au). |
This can still be disabled by setting the flag
-x86-use-aa=false
. All tests have been updated to account for this change except:test/CodeGen/X86/regalloc-advanced-split-cost.ll
Where the spill needed for part of the test disappears with codegen AA enabled (so it is left disabled for that test).
Enabling AA during codegen makes X86 consistent with other targets such as AArch64 and RISC-V. This will avoid regressing x86 targets when using the new
llvm.sincos
intrinsic see: #121763