Patches I used for issue #1622 #1623

fenrus75 · 2018-06-17T15:53:45Z

No description provided.

The jobs array is getting initialized in O(compiled cpus^2) complexity. Distros and people with bigger systems will use pretty high values (128 or 256 or more) for this value, leading to interesting bubbles in performance. Baseline (single threaded performance) gets roughly 13 - 15 multiplications per cycle in the interesting range (threading kicks in at 65x65 mult by 65x65). The hardware is capable of 32 multiplications per cycle theoretically. Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10703.9 10.6 0.0% 17990.6 6.3 0.0% 64 x 64 20778.4 12.8 0.0% 40629.2 6.5 0.0% 65 x 65 26869.9 10.3 0.0% 52545.7 5.3 0.0% 80 x 80 38104.5 13.5 0.0% 72492.7 7.1 0.0% 96 x 96 61626.4 14.4 0.0% 113983.8 7.8 0.0% 112 x 112 91803.8 15.3 0.0% 180987.3 7.8 0.0% 128 x 128 133161.4 15.8 0.0% 258374.3 8.1 0.0% When threading is turned on TARGET=SKYLAKEX F_COMPILER=GFORTRAN SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0 NUM_THREADS=128 Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10725.9 10.5 -0.2% 18134.9 6.2 -0.8% 64 x 64 20500.6 12.9 1.3% 40929.1 6.5 -0.7% 65 x 65 2040832.1 0.1 -7495.2% 2097633.6 0.1 -3892.0% 80 x 80 2063129.1 0.2 -5314.4% 2119925.2 0.2 -2824.3% 96 x 96 2070374.5 0.4 -3259.6% 2173604.4 0.4 -1806.9% 112 x 112 2111721.5 0.7 -2169.6% 2263330.8 0.6 -1170.0% 128 x 128 2276181.5 0.9 -1609.3% 2377228.9 0.9 -820.1% There is a deep deep cliff once you hit 65x65 With this patch Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% The cliff is very significantly reduced. (more to follow)

The use of _Atomic leads to really bad code generation in the compiler (on x86, you get 2 "mfence" memory barriers around each access with gcc8, despite x86 being ordered and cache coherent). But there's a fallback in the code that just uses volatile which is more than plenty in practice. If we're nervous about cross thread synchronization for these variables, we should make the YIELD function be a compiler/memory barrier instead. performance before (after last commit) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10630.0 10.6 0.7% 18112.8 6.2 -0.7% 64 x 64 20374.8 13.0 1.9% 40487.0 6.5 0.4% 65 x 65 141955.2 1.9 -428.3% 146708.8 1.9 -179.2% 80 x 80 178921.1 2.9 -369.6% 186032.7 2.8 -156.6% 96 x 96 205436.2 4.3 -233.4% 224513.1 3.9 -97.0% 112 x 112 244408.2 5.8 -162.7% 262158.7 5.4 -47.1% 128 x 128 321334.5 6.5 -141.3% 333829.0 6.3 -29.2% Performance with this patch (roughly a 2x improvement): Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3%

param.h defines a per-platform SWITCH_RATIO, which is used as a measure for how fine grained the blocks for gemm need to be split up. Many platforms define this to 4. The reality is that the gemm low level implementation for SkylakeX likes bigger blocks due to the nature of SIMD... by tuning the SWITCH_RATIO to 32 the threading performance improves significantly: Before Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10756.0 10.5 -0.5% 18296.7 6.1 -1.7% 64 x 64 20490.0 12.9 1.4% 40615.0 6.5 0.0% 65 x 65 83528.3 3.3 -210.9% 96319.0 2.9 -83.3% 80 x 80 101453.5 5.1 -166.3% 128021.7 4.0 -76.6% 96 x 96 149795.1 5.9 -143.1% 168059.4 5.3 -47.4% 112 x 112 191481.2 7.3 -105.8% 204165.0 6.9 -14.6% 128 x 128 265019.2 7.9 -99.0% 272006.4 7.7 -5.3% After Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 10666.3 10.6 0.4% 18236.9 6.2 -1.4% 64 x 64 20410.1 13.0 1.8% 39925.8 6.6 1.7% 65 x 65 34983.0 7.9 -30.2% 51494.6 5.4 2.0% 80 x 80 39769.1 13.0 -4.4% 63805.2 8.1 12.0% 96 x 96 45169.6 19.7 26.7% 80065.8 11.1 29.8% 112 x 112 57026.1 24.7 38.7% 99535.5 14.2 44.1% 128 x 128 64789.8 32.5 51.3% 117407.2 17.9 54.6% With this change, threading starts to be a win already at 96x96

Similar to the SKYLAKEX patch, 32 seems to work best (much better than 4 or 16) Before (4) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15554.3 7.2 0.2% 30353.8 3.7 0.3% 64 x 64 30346.8 8.7 1.6% 63495.0 4.1 -0.1% 65 x 65 81668.1 3.4 -123.3% 82705.2 3.3 -21.2% 80 x 80 105045.9 4.9 -95.5% 115226.0 4.5 -2.2% 96 x 96 152461.2 5.8 -74.3% 148156.3 6.0 16.4% 112 x 112 188505.2 7.5 -42.2% 171187.3 8.2 36.4% 128 x 128 257884.0 8.1 -39.5% 224764.8 9.3 46.0% Intermediate (16) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15565.7 7.2 0.2% 30378.9 3.7 0.2% 64 x 64 30430.2 8.7 1.3% 63046.4 4.2 0.6% 65 x 65 27306.0 10.1 25.3% 38879.2 7.1 43.0% 80 x 80 51008.7 10.1 5.1% 61007.6 8.4 45.9% 96 x 96 70856.7 12.5 19.0% 83403.1 10.6 53.0% 112 x 112 84769.9 16.6 36.0% 99920.1 14.1 62.9% 128 x 128 84213.2 25.0 54.5% 113024.2 18.6 72.8% After (32) Matrix SGEMM cycles MPC DGEMM cycles MPC 48 x 48 15537.3 7.2 0.3% 30537.0 3.6 -0.3% 64 x 64 30352.7 8.7 1.6% 62597.8 4.2 1.3% 65 x 65 36857.0 7.5 -0.8% 56167.6 4.9 17.7% 80 x 80 42552.6 12.1 20.8% 69536.7 7.4 38.3% 96 x 96 52101.5 17.1 40.5% 91016.1 9.7 48.7% 112 x 112 63853.7 22.1 51.8% 110507.4 12.7 58.9% 128 x 128 73966.1 28.4 60.0% 163146.4 12.9 60.8%

a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect

Whie on x86(64) one does not normally need full memory barriers, it's good practice to at least use compiler barriers for places where on other architectures memory barriers are used; this prevents the compiler from over-optimizing.

fenrus75 · 2018-06-17T17:58:47Z

updated with barrier strengthening

make it so that if (foo) RMB; else MB; is always done correctly and without syntax surprises

oon3m0oo · 2018-06-21T10:12:03Z

Interestingly enough TSAN claims there are still data races here, though they go away if I use _Atomic instead of volatile for the working array. I'll raise a new issue.

fenrus75 added 3 commits June 17, 2018 15:32

fenrus75 mentioned this pull request Jun 17, 2018

Threading threshold tuning needed for sgemm/dgemm #1622

Closed

fenrus75 added 3 commits June 17, 2018 17:08

Add missing barriers in gemm scheduler

73de176

a few places in the gemm scheduler code were missing barriers; the code likely worked OK due to heavy use of volatile / _Atomic but there's no reason to get this incorrect

On x86-64, make MB/WMB compiler barriers

7e39ffe

Whie on x86(64) one does not normally need full memory barriers, it's good practice to at least use compiler barriers for places where on other architectures memory barriers are used; this prevents the compiler from over-optimizing.

make WMB / MB safer on x86-64

2ddc96c

make it so that if (foo) RMB; else MB; is always done correctly and without syntax surprises

martin-frbg merged commit 5a6a2be into OpenMathLib:develop Jun 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Patches I used for issue #1622 #1623

Patches I used for issue #1622 #1623

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

oon3m0oo commented Jun 21, 2018 •

edited

Loading

Uh oh!

Uh oh!

Patches I used for issue #1622 #1623

Patches I used for issue #1622 #1623

Uh oh!

Conversation

fenrus75 commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

oon3m0oo commented Jun 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

oon3m0oo commented Jun 21, 2018 •

edited

Loading