Skip to content

[LoongArch] Impl TTI hooks for LoongArch to support LoopDataPrefetch pass #118437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 20, 2025

Conversation

zhaoqi5
Copy link
Contributor

@zhaoqi5 zhaoqi5 commented Dec 3, 2024

Inspired by https://reviews.llvm.org/D146600, this commit adds
some TTI hooks for LoongArch to make LoopDataPrefetch pass
really work. Including:

  • getCacheLineSize(): 64 for loongarch64.
  • getPrefetchDistance(): After testing SPEC CPU 2017, improvements
    taken by prefetching are more obvious when set PrefetchDistance to
    200(results shown blow), although different benchmarks fit for different
    best choice.
  • enableWritePrefetching(): store prefetch is supported by LoongArch,
    so set WritePrefetching to true in default.
  • getMinPrefetchStride() and getMaxPrefetchIterationsAhead() still
    use default values: 1 and UINT_MAX, so not override them.

After this commit, the test added by https://reviews.llvm.org/D146600 can
generate llvm.prefetch intrinsic IR correctly.

Results of spec2017rate benchmarks (testing date: ref, copies: 1):

  • For all C/C++ benchmarks, compared to O3+novec/lsx/lasx, prefetch can
    bring about -1.58%/0.31%/0.07% performance improvement for int benchmarks
    and 3.26%/3.73%/3.78% improvement for floating point benchmarks. (Only
    O3+novec+prefetch decreases when testing intrate.)
  • But prefetch results in performance reduction almost for every Fortran
    benchmark compiled by flang. While considering all C/C++/Fortran
    benchmarks, prefetch performance will decrease about 1% ~ 5%.

FIXME: Keep loongarch-enable-loop-data-prefetch option default to false
for now due to the bad effect for Fortran.

@llvmbot
Copy link
Member

llvmbot commented Dec 3, 2024

@llvm/pr-subscribers-backend-loongarch

@llvm/pr-subscribers-llvm-transforms

Author: ZhaoQi (zhaoqi5)

Changes

Inspired by https://reviews.llvm.org/D146600, this commit adds some TTI hooks for LoongArch to make LoopDataPrefetch pass really work. Including:

  • getCacheLineSize(): 64 for loongarch64.
  • getPrefetchDistance(): After testing SPEC CPU 2017, improvements taken by prefetching are more obvious when set PrefetchDistance to 200(results shown blow), although different benchmarks fit for different best choice.
  • enableWritePrefetching(): store prefetch is supported by LoongArch, so set WritePrefetching to true in default.
  • getMinPrefetchStride() and getMaxPrefetchIterationsAhead() still use default values: 1 and UINT_MAX, so not override them.

After this commit, the test added by https://reviews.llvm.org/D146600 can generate llvm.prefetch intrinsic IR correctly.

TODO: SPEC CPU 2017 is retesting, results waiting for add here.

TODO: Set loongarch-enable-loop-data-prefetch option default to true.


Full diff: https://github.com/llvm/llvm-project/pull/118437.diff

3 Files Affected:

  • (modified) llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.cpp (+6)
  • (modified) llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.h (+4)
  • (modified) llvm/test/Transforms/LoopDataPrefetch/LoongArch/basic.ll (+27-6)
diff --git a/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.cpp b/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.cpp
index 5fbc7c734168d1..cbc9c3f3beca00 100644
--- a/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.cpp
+++ b/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.cpp
@@ -89,4 +89,10 @@ LoongArchTTIImpl::getPopcntSupport(unsigned TyWidth) {
   return ST->hasExtLSX() ? TTI::PSK_FastHardware : TTI::PSK_Software;
 }
 
+unsigned LoongArchTTIImpl::getCacheLineSize() const { return 64; }
+
+unsigned LoongArchTTIImpl::getPrefetchDistance() const { return 200; }
+
+bool LoongArchTTIImpl::enableWritePrefetching() const { return true; }
+
 // TODO: Implement more hooks to provide TTI machinery for LoongArch.
diff --git a/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.h b/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.h
index f7ce75173be203..b3edf131c584c4 100644
--- a/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.h
+++ b/llvm/lib/Target/LoongArch/LoongArchTargetTransformInfo.h
@@ -47,6 +47,10 @@ class LoongArchTTIImpl : public BasicTTIImplBase<LoongArchTTIImpl> {
   const char *getRegisterClassName(unsigned ClassID) const;
   TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
 
+  unsigned getCacheLineSize() const override;
+  unsigned getPrefetchDistance() const override;
+  bool enableWritePrefetching() const override;
+
   // TODO: Implement more hooks to provide TTI machinery for LoongArch.
 };
 
diff --git a/llvm/test/Transforms/LoopDataPrefetch/LoongArch/basic.ll b/llvm/test/Transforms/LoopDataPrefetch/LoongArch/basic.ll
index 8553171ac68ac9..0313bbd8832876 100644
--- a/llvm/test/Transforms/LoopDataPrefetch/LoongArch/basic.ll
+++ b/llvm/test/Transforms/LoopDataPrefetch/LoongArch/basic.ll
@@ -1,16 +1,38 @@
-;; Tag this 'XFAIL' because we need a few more TTIs and ISels.
-; XFAIL: *
-; RUN: opt --mtriple=loongarch64 -mattr=+d --passes=loop-data-prefetch -loongarch-enable-loop-data-prefetch -S < %s | FileCheck %s
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt --mtriple=loongarch64 -mattr=+d --passes=loop-data-prefetch -S < %s | FileCheck %s
 
 define void @foo(ptr %a, ptr %b) {
+; CHECK-LABEL: define void @foo(
+; CHECK-SAME: ptr [[A:%.*]], ptr [[B:%.*]]) #[[ATTR0:[0-9]+]] {
+; CHECK-NEXT:  [[ENTRY:.*]]:
+; CHECK-NEXT:    br label %[[FOR_BODY:.*]]
+; CHECK:       [[FOR_BODY]]:
+; CHECK-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, %[[ENTRY]] ], [ [[INDVARS_IV_NEXT:%.*]], %[[FOR_BODY]] ]
+; CHECK-NEXT:    [[TMP0:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 3
+; CHECK-NEXT:    [[TMP1:%.*]] = add i64 [[TMP0]], 200
+; CHECK-NEXT:    [[SCEVGEP1:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP2:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 3
+; CHECK-NEXT:    [[TMP3:%.*]] = add i64 [[TMP2]], 200
+; CHECK-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[B]], i64 [[TMP3]]
+; CHECK-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds double, ptr [[B]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    call void @llvm.prefetch.p0(ptr [[SCEVGEP]], i32 0, i32 3, i32 1)
+; CHECK-NEXT:    [[TMP4:%.*]] = load double, ptr [[ARRAYIDX]], align 8
+; CHECK-NEXT:    [[ADD:%.*]] = fadd double [[TMP4]], 1.000000e+00
+; CHECK-NEXT:    [[ARRAYIDX2:%.*]] = getelementptr inbounds double, ptr [[A]], i64 [[INDVARS_IV]]
+; CHECK-NEXT:    call void @llvm.prefetch.p0(ptr [[SCEVGEP1]], i32 1, i32 3, i32 1)
+; CHECK-NEXT:    store double [[ADD]], ptr [[ARRAYIDX2]], align 8
+; CHECK-NEXT:    [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
+; CHECK-NEXT:    [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1600
+; CHECK-NEXT:    br i1 [[EXITCOND]], label %[[FOR_END:.*]], label %[[FOR_BODY]]
+; CHECK:       [[FOR_END]]:
+; CHECK-NEXT:    ret void
+;
 entry:
   br label %for.body
 
-; CHECK: for.body:
 for.body:                                         ; preds = %for.body, %entry
   %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
   %arrayidx = getelementptr inbounds double, ptr %b, i64 %indvars.iv
-; CHECK: call void @llvm.prefetch
   %0 = load double, ptr %arrayidx, align 8
   %add = fadd double %0, 1.000000e+00
   %arrayidx2 = getelementptr inbounds double, ptr %a, i64 %indvars.iv
@@ -19,7 +41,6 @@ for.body:                                         ; preds = %for.body, %entry
   %exitcond = icmp eq i64 %indvars.iv.next, 1600
   br i1 %exitcond, label %for.end, label %for.body
 
-; CHECK: for.end:
 for.end:                                          ; preds = %for.body
   ret void
 }

@zhaoqi5 zhaoqi5 marked this pull request as draft December 3, 2024 07:14
…pass

Inspired by https://reviews.llvm.org/D146600, this commit adds
some TTI hooks for LoongArch to make LoopDataPrefetch pass
really work. Including:

- `getCacheLineSize()`: 64 for loongarch64.
- `getPrefetchDistance()`: After testing SPEC CPU 2017, improvements
taken by prefetching are more obvious when set PrefetchDistance to
200(results shown blow), although different benchmarks fit for different
best choice.
- `enableWritePrefetching()`: store prefetch is supported by LoongArch,
so set WritePrefetching to true in default.
- `getMinPrefetchStride()` and `getMaxPrefetchIterationsAhead()` still
use default values: 1 and UINT_MAX, so not override them.

After this commit, the test added by https://reviews.llvm.org/D146600 can
generate llvm.prefetch intrinsic IR correctly.

Results of spec2017rate benchmarks (testing date: ref, copies: 1):
- For all C/C++ benchmarks, compared to O3+novec/lsx/lasx, prefetch can
bring about -1.58%/0.31%/0.07% performance improvement for int benchmarks
and 3.26%/3.73%/3.78% improvement for floating point benchmarks. (Only
O3+novec+prefetch decreases when testing intrate.)
- But prefetch results in performance reduction almost for every Fortran
benchmark compiled by flang. While considering all C/C++/Fortran
benchmarks, prefetch performance will decrease about 1% ~ 5%.

FIXME: Keep `loongarch-enable-loop-data-prefetch` option default to false
for now due to the bad effect for Fortran.
@zhaoqi5 zhaoqi5 force-pushed the enable-loopdataprefetch-pass branch from 54a8ca4 to b13547b Compare December 11, 2024 08:47
@zhaoqi5 zhaoqi5 marked this pull request as ready for review December 11, 2024 08:55
@zhaoqi5 zhaoqi5 merged commit ca4886b into llvm:main Jan 20, 2025
10 checks passed
@zhaoqi5 zhaoqi5 deleted the enable-loopdataprefetch-pass branch January 20, 2025 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants