Threading threshold tuning needed for sgemm/dgemm #1622

fenrus75 · 2018-06-16T23:21:08Z

Based on the data below, the threading threshold tuning might need some help; if you happen to have a good pointer then I'll play and help

The baseline for performance is this:

TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=1

   Matrix       Float cycles            CPM                                     Double cycles           CPM
   1 x 1               240.9         1.0000          0.0%                              306.7         1.0000          0.0%
   4 x 4               326.4         1.3510          0.0%                              455.0         2.3328          0.0%
   6 x 6               634.2         1.8255          0.0%                              693.3         1.7942          0.0%
   8 x 8               477.3         0.4636          0.0%                              565.9         0.5081          0.0%
  10 x 10              930.5         0.6906          0.0%                             1365.5         1.0598          0.0%
  16 x 16             1166.8         0.2263          0.0%                             1438.2         0.2765          0.0%
  20 x 20             2102.9         0.2329          0.0%                             3411.6         0.3882          0.0%
  32 x 32             4572.0         0.1322          0.0%                             6164.2         0.1788          0.0%
  40 x 40             7998.6         0.1212          0.0%                            12998.0         0.1983          0.0%
  64 x 64            20376.5         0.0768          0.0%                            39989.2         0.1514          0.0%
  80 x 80            37374.8         0.0725          0.0%                            71594.1         0.1392          0.0%
 100 x 100           75786.8         0.0755          0.0%                           143355.2         0.1430          0.0%
 128 x 128          127663.1         0.0608          0.0%                           251965.2         0.1200          0.0%
 150 x 150          258028.7         0.0764          0.0%                           457672.6         0.1355          0.0%
 200 x 200          477604.2         0.0597          0.0%                           971440.2         0.1214          0.0%
 256 x 256          977700.6         0.0583          0.0%                          2259957.4         0.1347          0.0%
 300 x 300         1720165.4         0.0637          0.0%                          3282089.5         0.1215          0.0%
 400 x 400         3911583.9         0.0611          0.0%                          9495537.9         0.1484          0.0%
 500 x 500         7490500.9         0.0599          0.0%                         27368762.1         0.2189          0.0%
 512 x 512         8151211.0         0.0607          0.0%                         32727483.8         0.2438          0.0%
 600 x 600        14568998.7         0.0674          0.0%                         27482490.9         0.1272          0.0%
 700 x 700        28441907.2         0.0829          0.0%                         47884964.0         0.1396          0.0%
 800 x 800        33472600.5         0.0654          0.0%                         71845317.9         0.1403          0.0%
1000 x 1000         66664296.2       0.0667          0.0%                        217093066.9         0.2171          0.0%
1024 x 1024         76171844.0       0.0709          0.0%                        261327051.0         0.2434          0.0%
1200 x 1200        129456787.0       0.0749          0.0%                        304937633.0         0.1765          0.0%
2000 x 2000        715247226.5       0.0894          0.0%                       1823524984.8         0.2279          0.0%

The Cycles-per-multiply metric is likely the most useful as performance metric; for Float the hardware in question has a theoretical limit of 0.03125 CPM and we get close to half of theoretical as matrixes get bigger.

Once threading gets enabled (20 logical cpus, 10 physical cores on the system) things get interesting:
(the percentages is performance delta to baseline, negative numbers are performance loss)

TARGET=SKYLAKEX F_COMPILER=GFORTRAN  SHARED=1 DYNAMIC_THREADS=1 USE_OPENMP=0  NUM_THREADS=20

   Matrix       Float cycles            CPM                                     Double cycles           CPM
   1 x 1               261.2         1.0000         -8.4%                              319.2         1.0000         -4.1%
   4 x 4               343.0         1.2943         -5.1%                              464.8         2.2901         -2.1%
   6 x 6               634.1         1.7314          0.0%                              707.2         1.8010         -2.0%
   8 x 8               475.2         0.4199          0.4%                              576.0         0.5034         -1.8%
  10 x 10              941.8         0.6817         -1.2%                             1366.0         1.0478         -0.0%
  16 x 16             1192.4         0.2276         -2.2%                             1412.1         0.2671          1.8%
  20 x 20             2132.0         0.2340         -1.4%                             3446.1         0.3910         -1.0%
  32 x 32             4554.3         0.1310          0.4%                             6170.7         0.1786         -0.1%
  40 x 40             8028.2         0.1214         -0.4%                            12959.8         0.1975          0.3%
  64 x 64            20327.9         0.0766          0.2%                            40080.3         0.1517         -0.2%
  80 x 80           156349.0         0.3049       -318.3%                           180205.0         0.3513       -151.7%
 100 x 100          202101.6         0.2018       -166.7%                           220979.9         0.2207        -54.1%
 128 x 128          306939.2         0.1462       -140.4%                           312681.5         0.1489        -24.1%
 150 x 150          311053.7         0.0921        -20.6%                           361673.4         0.1071         21.0%
 200 x 200          404916.4         0.0506         15.2%                           486996.6         0.0608         49.9%
 256 x 256          599879.4         0.0357         38.6%                           743093.4         0.0443         67.1%
 300 x 300          931914.4         0.0345         45.8%                          1248691.5         0.0462         62.0%
 400 x 400          930678.0         0.0145         76.2%                          1645345.5         0.0257         82.7%
 500 x 500         1740345.0         0.0139         76.8%                          2897384.9         0.0232         89.4%
 512 x 512         2012971.7         0.0150         75.3%                          3223831.4         0.0240         90.1%
 600 x 600         3428350.2         0.0159         76.5%                          5333402.9         0.0247         80.6%
 700 x 700         4427124.5         0.0129         84.4%                          7267851.8         0.0212         84.8%
 800 x 800         4461900.5         0.0087         86.7%                          8944377.1         0.0175         87.6%
1000 x 1000          9285704.1       0.0093         86.1%                         16669766.7         0.0167         92.3%
1024 x 1024         11111463.6       0.0103         85.4%                         19204966.9         0.0179         92.7%
1200 x 1200         14541484.8       0.0084         88.8%                         26442997.1         0.0153         91.3%
2000 x 2000         56051476.5       0.0070         92.2%                        109212662.6         0.0137         94.0%

For very small matrixes there is a little bit of overhead, but thanks to @oon3m0oo and @sandwichmaker, this overhead is pretty tiny.
HOWEVER, once threading kicks in (just after 64x64) performance tanks compares to the baseline, and does not recover until 200x200 size.

This is not related to OpenMP-versus-threads, with OpenMP the data looks like this:

   Matrix               Float cycles            CPM                                     Double cycles           CPM
   1 x 1                       321.3         1.0000        -33.0%                              367.6         1.0000        -20.2%
   4 x 4                       407.0         1.3538        -24.4%                              533.1         2.6029        -18.0%
   6 x 6                       724.0         1.8690        -15.6%                              749.1         1.7712         -9.4%
   8 x 8                       537.5         0.4243        -11.5%                              645.7         0.5452        -12.8%
  10 x 10                     1010.8         0.6905         -8.2%                             1442.2         1.0756         -5.6%
  16 x 16                     1234.5         0.2232         -4.8%                             1510.9         0.2794         -3.8%
  20 x 20                     2187.3         0.2334         -2.4%                             3481.5         0.3894         -1.6%
  32 x 32                     4552.2         0.1291         -0.5%                             6221.6         0.1787         -1.8%
  40 x 40                     8160.9         0.1225         -1.5%                            13212.9         0.2007         -1.3%
  64 x 64                    20592.2         0.0773          0.2%                            39986.8         0.1511          0.2%
  80 x 80                   142826.6         0.2783       -275.9%                           167659.7         0.3267       -133.4%
 100 x 100                  196696.8         0.1964       -162.9%                           210191.3         0.2098        -50.4%
 128 x 128                  304810.0         0.1452       -132.4%                           304392.2         0.1450        -19.4%
 150 x 150                  300171.4         0.0888        -14.0%                           354805.6         0.1050         21.7%
 200 x 200                  391677.0         0.0489         20.3%                           487251.6         0.0609         50.1%
 256 x 256                  599838.1         0.0357         37.7%                           741835.3         0.0442         65.9%
 300 x 300                  930674.1         0.0345         45.6%                          1253722.9         0.0464         63.8%
 400 x 400                  950285.4         0.0148         75.4%                          1638163.7         0.0256         79.8%
 500 x 500                 1734750.9         0.0139         77.3%                          2909310.0         0.0233         88.8%
 600 x 600                 3425653.7         0.0159         77.3%                          5346582.5         0.0248         79.7%
 700 x 700                 4434540.4         0.0129         84.4%                          7263461.4         0.0212         84.2%
 800 x 800                 4450149.4         0.0087         86.1%                          8915261.2         0.0174         88.2%
1000 x 1000                9304762.8         0.0093         86.0%                         16688999.4         0.0167         92.1%
1200 x 1200               14489066.5         0.0084         88.8%                         26425386.2         0.0153         91.3%
2000 x 2000               55720002.7         0.0070         92.2%                        108477700.1         0.0136         93.5%

Which shows OpenMP has a bit more baseline overhead, but has otherwise the same problem after 64x64 to 200x200

So my conclusion is that the threading kicks in at too small matrices currently.... if it started at 200x200 then there would be a win across the board.

The text was updated successfully, but these errors were encountered:

fenrus75 · 2018-06-16T23:33:28Z

(I've run the same data replacing the sched_yield with pause but no remarkable difference)

fenrus75 · 2018-06-17T03:29:07Z

I have root caused this issue and have a slightly hacky solution.

testing now and pondering cleanups

fenrus75 · 2018-06-17T03:34:02Z

The cutoff point is still at 65 for now (needs to move a little), but the "threading hump" of cost
is now far lower, and while 80x80 is still at a small-ish loss, 100x100 is now already in the win column

   Matrix       Float cycles            CPM                                     Double cycles           CPM
   1 x 1               247.1         1.0000         -2.6%                              306.2         1.0000          0.2%
   4 x 4               336.9         1.4180         -3.2%                              452.8         2.3061          0.5%
   6 x 6               630.7         1.7805          0.6%                              691.0         1.7862          0.3%
   8 x 8               484.3         0.4651         -1.5%                              574.0         0.5249         -1.4%
  10 x 10              937.4         0.6912         -0.7%                             1364.6         1.0595          0.1%
  16 x 16             1181.1         0.2283         -1.2%                             1438.4         0.2767         -0.0%
  20 x 20             2121.3         0.2344         -0.9%                             3446.1         0.3926         -1.0%
  32 x 32             4780.4         0.1384         -4.6%                             6437.7         0.1871         -4.4%
  40 x 40             8023.6         0.1215         -0.3%                            12840.2         0.1959          1.2%
  64 x 64            20716.9         0.0781         -1.7%                            40581.4         0.1536         -1.5%
  65 x 65            33413.3         0.1208                                          42120.6         0.1523
  80 x 80            50701.9         0.0985        -35.7%                            72795.7         0.1416         -1.7%
 100 x 100           72001.1         0.0718          5.0%                           101620.5         0.1013         29.1%
 128 x 128           87409.1         0.0416         31.5%                           122312.3         0.0582         51.5%
 150 x 150          106728.5         0.0316         58.6%                           147681.9         0.0437         67.7%
 180 x 180          242268.4         0.0415                                         366940.2         0.0629
 200 x 200          270079.6         0.0337         43.5%                           411212.2         0.0514         57.7%
 256 x 256          337636.9         0.0201         65.5%                           577665.3         0.0344         74.4%
 300 x 300          388344.9         0.0144         77.4%                           706123.6         0.0261         78.5%
 400 x 400         1063404.2         0.0166         72.8%                          2012490.2         0.0314         78.8%
 500 x 500         1437605.1         0.0115         80.8%                          2782399.2         0.0223         89.8%
 512 x 512         1619863.1         0.0121         80.1%                          3020351.9         0.0225         90.8%
 600 x 600         2211476.0         0.0102         84.8%                          4367160.9         0.0202         84.1%
 700 x 700         4446271.8         0.0130         84.4%                          7281107.3         0.0212         84.8%
 800 x 800         4460530.9         0.0087         86.7%                          8986986.8         0.0176         87.5%
1000 x 1000          9349950.6       0.0093         86.0%                         16657296.8         0.0167         92.3%
1024 x 1024         11210482.1       0.0104         85.3%                         19023021.7         0.0177         92.7%
1200 x 1200         14707282.8       0.0085         88.6%                         26613834.2         0.0154         91.3%
2000 x 2000         55728494.8       0.0070         92.2%                        108737288.6         0.0136         94.0%

martin-frbg · 2018-06-17T08:38:18Z

The Julia folks keep telling us that GEMM_MULTITHREAD_THRESHOLD should be set at 50 rather thanthe current value of 4 that wernsaar kept defending. Could still be that the real issue lies elsewhere, some assumptions in the code may have been valid when Goto did his groundbreaking work but less so on current hardware.

oon3m0oo · 2018-06-17T09:49:02Z

I've run similar experiments lately, and yes, somewhere from 30-60 gives the best performance overall, at least for non-complex matrices (I haven't tried that). Note that there's a cube-root relationship with the side length for determining whether to use threads.

I can post the tests I've run tomorrow, when I'm back in the office.

fenrus75 · 2018-06-17T11:06:11Z

At least part of the issue is NOT that the threshold is wrong (it may well be, but it seems at least in the ballpark now) but that the base cost of starting threading is too high and the threading is inefficient due to a set of "performance bugs", which makes the tradeoff not work right in practice (as per data above).

I have hacks for the perf bugs I found so far, and the base cost is now an order of magnitude lower.
The perf drop at 80x80 is now in the 5% range; the MPC (or CPM; I've decided to switch to MPC since it's more readable to me) is pretty much flat now as threading switches on

   Matrix          SGEMM cycles    MPC                                   DGEMM cycles      MPC
   1 x 1                  263.4    1.0      -7.9%                               307.6      1.0      -0.1%
   2 x 2                  279.9    0.5      -3.4%                               333.6      0.3      -1.6%
   4 x 4                  342.8    0.8      -4.6%                               458.7      0.4      -1.1%
   6 x 6                  624.9    0.6       1.2%                               710.6      0.5      -1.7%
   8 x 8                  479.0    2.4      -0.2%                               577.2      1.9      -1.2%
  10 x 10                 942.9    1.5      -1.4%                              1371.0      0.9      -0.5%
  16 x 16                1176.1    4.5      -1.7%                              1429.5      3.6      -0.2%
  20 x 20                2135.5    4.3      -1.6%                              3464.3      2.5      -0.9%
  32 x 32                4451.7    7.8       1.0%                              6079.7      5.7       0.1%
  40 x 40                7983.6    8.3       0.4%                             12918.8      5.1       1.1%
  64 x 64               20536.3   12.9       0.5%                             40809.4      6.5      -1.3%
  80 x 80               39592.8   13.0      -5.9%                             63296.5      8.1      11.8%
  96 x 96               44868.0   19.8      27.9%                             79586.1     11.2      29.9%
 100 x 100              53464.2   18.8      28.6%                             86356.8     11.6      38.3%
 112 x 112              56807.3   24.8      38.0%                             98626.0     14.3      44.1%
 128 x 128              65333.6   32.2      49.0%                            117188.4     17.9      53.6%
 200 x 200             176561.5   45.4      63.0%                            312800.8     25.6      67.6%
 256 x 256             298573.7   56.2      68.5%                            537431.7     31.2      75.5%
 300 x 300             316939.4   85.3      81.4%                            660227.1     40.9      79.9%
 400 x 400             847685.0   75.5      77.9%                           1588962.2     40.3      78.8%
 500 x 500            1386044.3   90.2      81.5%                           2547522.5     49.1      89.8%
 512 x 512            1672045.1   80.3      79.4%                           2931171.6     45.8      90.3%
 600 x 600            1919834.6  112.5      86.4%                           3879175.4     55.7      85.0%
 700 x 700            4383896.9   78.2      84.1%                           7220870.2     47.5      82.8%
 800 x 800            4443824.0  115.2      85.8%                           8852770.2     57.8      87.4%
 900 x 900            8443432.3   86.3      81.5%                          15995600.0     45.6      85.4%
1000 x 1000           9261327.4  108.0      85.4%                          16482852.5     60.7      91.6%
1024 x 1024          11042164.6   97.2      84.7%                          18920339.1     56.8      91.7%
2000 x 2000          55627755.2  143.8      92.0%                         108700072.3     73.6      93.2%

fenrus75 · 2018-06-17T11:07:38Z

next steps are convincing myself that the hacks to fix this are valid fixes and will not break stuff

fenrus75 · 2018-06-17T11:15:53Z

(note that an 80x80 sgemm went from 156k cycles to 40k cycles with the fixes, so I think it's worth my time to keep poking)

martin-frbg · 2018-06-17T12:25:58Z

Sounds intriguing but has too little detail to try an intelligent comment. Though with level3 blas there is the known issue that someone thought it necessary to leave a "USE_SIMPLE_THREADED_LEVEL3" option available, and there may also be some unfinished work in wernsaar's tree from shortly before he dropped out of the project (for whatever reason, hopefully not medical)

brada4 · 2018-06-17T12:59:46Z

I had some theory about bits moving between outer caches if task does not fit.
I will try to mash up some example (main idea is to run (IN+OUT)/L3_SIZE threads maximum, not more)

fenrus75 · 2018-06-17T15:54:20Z

#1623

has the basic code to win the performance back

martin-frbg · 2018-06-17T16:20:23Z

That use of _Atomic comes from #660 (comment) - I did not know any better. The other issues you found seem to be specific to building for "your" SkylakeX with support for a larger-than-necessary number of threads, so should not be able to cause regressions - while providing a nice boost to distribution-built binaries.

fenrus75 · 2018-06-17T16:45:12Z

If you're ok with the general direction I'll be tuning Haswell as well of course ;-)

I can poke at the _Atomic and instead put proper barriers in the places where it matters; there are not THAT many.

martin-frbg · 2018-06-17T16:58:11Z

If you promise not to cause any time warps ? ;-)

fenrus75 · 2018-06-17T17:23:40Z

Causality shall be strictly preserved

Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622

sandwichmaker · 2018-06-21T14:28:43Z

I ran some experiments comparing single-threaded and multi-threaded builds,
old = USE_THREADS=0
new = USE_THREADS=1

and there is no penalty for building with threads and the switch from multiple threads does not create a penalty either.

Nice work @fenrus75 and @oon3m0oo

should this issue be closed now?

Benchmark                Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------
BM_SGEMM/3            -0.0191         -0.0191           114           112           114           112
BM_SGEMM/4            -0.0131         -0.0132           113           111           113           111
BM_SGEMM/5            +0.0265         +0.0265           170           174           170           174
BM_SGEMM/6            +0.0142         +0.0143           192           194           192           194
BM_SGEMM/7            -0.0019         -0.0019           268           268           268           268
BM_SGEMM/8            +0.0082         +0.0083           225           226           225           226
BM_SGEMM/9            -0.0069         -0.0069           261           259           261           259
BM_SGEMM/10           -0.0278         -0.0277           302           294           302           294
BM_SGEMM/15           -0.0377         -0.0378           714           687           714           687
BM_SGEMM/16           -0.0056         -0.0056           542           539           542           538
BM_SGEMM/20           -0.0297         -0.0297           760           737           760           737
BM_SGEMM/24           -0.0573         -0.0573          1164          1097          1164          1097
BM_SGEMM/28           -0.0279         -0.0279          1581          1536          1581          1536
BM_SGEMM/31           -0.0317         -0.0317          2944          2850          2944          2850
BM_SGEMM/32           -0.0274         -0.0274          1978          1923          1977          1923
BM_SGEMM/40           -0.0184         -0.0184          3433          3370          3433          3370
BM_SGEMM/63           -0.0221         -0.0221         14126         13813         14125         13812
BM_SGEMM/64           -0.0289         -0.0289         10528         10223         10527         10223
BM_SGEMM/80           -0.3616         -0.3616         18466         11790         18465         11788
BM_SGEMM/100          -0.2479         -0.2480         34931         26271         34929         26268
BM_SGEMM/128          -0.2436         -0.2436         65341         49422         65339         49421
BM_SGEMM/150          -0.5135         -0.5135        116760         56805        116758         56804
BM_SGEMM/200          -0.5690         -0.5690        229653         98974        229649         98970
BM_SGEMM/256          -0.6572         -0.6572        465972        159738        465958        159729
BM_SGEMM/300          -0.7868         -0.7868        793316        169112        793302        169108
BM_SGEMM/400          -0.7474         -0.7474       1749708        442052       1749678        441970
BM_SGEMM/500          -0.8232         -0.8233       3433405        606865       3433345        606772
BM_SGEMM/600          -0.8807         -0.8807       5746933        685669       5746834        685648
BM_SGEMM/700          -0.9031         -0.9031       9252172        896181       9251886        896166
BM_SGEMM/800          -0.8356         -0.8356      13746228       2260284      13745248       2259475
BM_SGEMM/1000         -0.8801         -0.8801      26389934       3164928      26389233       3163966
BM_SGEMM/2000         -0.9151         -0.9151     214228551      18189699     214225094      18188921
BM_DGEMM/3            +0.0854         +0.0854           106           115           106           115
BM_DGEMM/4            +0.0228         +0.0228           110           113           110           113
BM_DGEMM/5            +0.0145         +0.0145           151           153           151           153
BM_DGEMM/6            -0.0266         -0.0265           168           163           168           163
BM_DGEMM/7            +0.0315         +0.0315           211           218           211           218
BM_DGEMM/8            +0.0291         +0.0291           196           202           196           202
BM_DGEMM/9            +0.0042         +0.0042           281           283           281           283
BM_DGEMM/10           -0.0079         -0.0080           316           313           316           313
BM_DGEMM/15           +0.0067         +0.0067           731           736           731           736
BM_DGEMM/16           +0.0026         +0.0027           614           615           614           615
BM_DGEMM/20           -0.0025         -0.0026          1036          1033          1036          1033
BM_DGEMM/24           +0.0225         +0.0225          1467          1500          1467          1500
BM_DGEMM/28           +0.0181         +0.0181          2295          2337          2295          2337
BM_DGEMM/31           +0.0015         +0.0015          3640          3645          3639          3645
BM_DGEMM/32           +0.0055         +0.0055          3114          3131          3113          3131
BM_DGEMM/40           -0.1363         -0.1363          6468          5586          6468          5586
BM_DGEMM/63           -0.0121         -0.0121         22168         21899         22167         21898
BM_DGEMM/64           +0.0032         +0.0032         18991         19051         18990         19051
BM_DGEMM/80           -0.4273         -0.4274         35586         20379         35585         20375
BM_DGEMM/100          -0.2213         -0.2213         63275         49275         63274         49273
BM_DGEMM/128          -0.4540         -0.4540        125256         68390        125251         68389
BM_DGEMM/150          -0.6066         -0.6066        208622         82069        208615         82067
BM_DGEMM/200          -0.6628         -0.6628        426045        143663        426038        143654
BM_DGEMM/256          -0.7497         -0.7497        881451        220643        881416        220638
BM_DGEMM/300          -0.7556         -0.7556       1404698        343269       1404662        343249
BM_DGEMM/400          -0.8089         -0.8090       3136908        599353       3136753        599263
BM_DGEMM/500          -0.8482         -0.8482       6008430        912181       6008153        912085
BM_DGEMM/600          -0.8758         -0.8758      10428297       1295649      10427815       1295626
BM_DGEMM/700          -0.9009         -0.9009      16525507       1638390      16525231       1638339
BM_DGEMM/800          -0.8662         -0.8708      24512505       3278752      24510857       3166849
BM_DGEMM/1000         -0.8970         -0.8971      46845468       4823339      46844445       4822155
BM_DGEMM/2000         -0.9150         -0.9150     370730758      31514862     370724407      31507258

martin-frbg · 2018-06-21T14:53:06Z

#1634 is about to bring _Atomic back, which according to d148ec4 would have an impact on performance...

fenrus75 · 2018-06-21T15:16:27Z

_Atomic might shut the compiler tool up, but it does not close any race conditions at all. if there are race conditions in the code, they are there with or without _Atomic. _Atomic is not magic in that sense. it's not a substitute for locking at all.

…

On Thu, Jun 21, 2018 at 7:53 AM Martin Kroeker ***@***.***> wrote: #1634 <#1634> is about to bring _Atomic back, which according to d148ec4 <d148ec4> would have an impact on performance... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1622 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFVusaLUNUMNArIqHzMCOGEWIaa_Jks5t-7NegaJpZM4UqpVI> .

fenrus75 · 2018-06-21T15:21:57Z

to be specific; to make things synchronization safe, one needs to do the equivalent of an "atomic exchange" as a general rule; _Atomic does not do that. It just makes "set" more expensive... in the openblas model, the data is just for "am I done" and that needs not to be atomic; it needs proper barriers (a WMB after writing, and a MB before reading, or at least an MB in the loop that polls during read, so before the 2nd read) I audited the file and added the places where (W)MB as missing.. now on non-GCC x86-64, MB/WMB might not be the correct primitives (e..g on x86 they need to be compiler barriers since the architecture is observably ordered natively) but that should be easy to fix for an LLVM/etc person who knows how their compiler does these On Thu, Jun 21, 2018 at 8:16 AM Arjan van de Ven <[email protected]> wrote:

…

_Atomic might shut the compiler tool up, but it does not close any race conditions at all. if there are race conditions in the code, they are there with or without _Atomic. _Atomic is not magic in that sense. it's not a substitute for locking at all. On Thu, Jun 21, 2018 at 7:53 AM Martin Kroeker ***@***.***> wrote: > #1634 <#1634> is about to bring > _Atomic back, which according to d148ec4 > <d148ec4> > would have an impact on performance... > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#1622 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABPeFVusaLUNUMNArIqHzMCOGEWIaa_Jks5t-7NegaJpZM4UqpVI> > . >

martin-frbg · 2018-06-21T15:37:53Z

Makes me wonder if these were just false positives from TSAN ?

oon3m0oo · 2018-06-21T15:43:54Z

I've reverted that part of the change until we figure this out. I'm also wondering if an explicit synchronization mechanism would be faster than the spin waits. Was that ever tried? For fun I tried using condition variables and it was horribly slow, but there are other options.

martin-frbg · 2018-06-21T15:56:21Z

Nobody willing or able to try as far as I remember. Mentioned in #731 (which got closed rather quickly after fixing a side topic) and I think #923 (guess who started it :-) and it may actually be resolved by your combined work now)

fenrus75 · 2018-06-21T16:16:03Z

I think we need to start by making what is the objective more explicit; it's hard to reason about synchronization without clearly knowing the expectations. So maybe something like this There is a (sometimes large) array of work items that need to be done (e.g. tiles in the matrix multiply) In the threaded environment, coordination between threads is needed for 1) which of the work blocks have work started on them by a thread to avoid multiple threads starting work on the same block 2) which work blocks have completed to avoid redoing work that's already done 3) data start dependencies; some blocks cannot be started before other work blocks are completed 4) data end dependencies; some blocks cannot complete their work before other blocks are completed 5) total job completion; the overall job is not completed (and the main caller cannot continue) until all work blocks are completed

…

On Thu, Jun 21, 2018 at 8:56 AM Martin Kroeker ***@***.***> wrote: Nobody willing or able to try as far as I remember. Mentioned in #731 <#731> (which got closed rather quickly after fixing a side topic) and I think #923 <#923> (guess who started it :-) and it may actually be resolved by your combined work now) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1622 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFQNPv1hKAt76vTtqhEjNHd5AR65pks5t-8ItgaJpZM4UqpVI> .

oon3m0oo · 2018-06-25T12:57:10Z

I think those are exactly how the code is structured now, and having tried both eventfds and futexes in an attempt to make it fast, I have to conclude that spin-waiting is indeed the best thing to do. I have, however, put up a PR (#1642) to make threading tools happy, if that's acceptable.

fenrus75 changed the title ~~Threading threshold tuning needed~~ Threading threshold tuning needed for sgemm/dgemm Jun 16, 2018

martin-frbg added this to the 0.3.1 milestone Jun 17, 2018

martin-frbg added a commit that referenced this issue Jun 18, 2018

Merge pull request #1623 from fenrus75/fast-thread

5a6a2be

Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622

oon3m0oo mentioned this issue Jun 21, 2018

Fix data races reported by TSAN. #1634

Merged

martin-frbg modified the milestones: 0.3.1, 0.3.2 Jun 30, 2018

martin-frbg modified the milestones: 0.3.2, 0.3.1 Jun 30, 2018

martin-frbg removed this from the 0.3.2 milestone Jul 21, 2018

tkoolen mentioned this issue Oct 22, 2018

BLAS.set_num_threads(1) drastically improves dynamics! performance for Atlas benchmark JuliaRobotics/RigidBodyDynamics.jl#500

Open

fenrus75 closed this as completed Dec 26, 2018

martin-frbg mentioned this issue Aug 11, 2019

Some thoughts for improving Haswell sgemm performance #2210

Closed

Threading threshold tuning needed for sgemm/dgemm #1622

Threading threshold tuning needed for sgemm/dgemm #1622

Comments

fenrus75 commented Jun 16, 2018

fenrus75 commented Jun 16, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

martin-frbg commented Jun 17, 2018

Uh oh!

oon3m0oo commented Jun 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

martin-frbg commented Jun 17, 2018

Uh oh!

brada4 commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

martin-frbg commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

martin-frbg commented Jun 17, 2018

Uh oh!

fenrus75 commented Jun 17, 2018

Uh oh!

sandwichmaker commented Jun 21, 2018

Uh oh!

martin-frbg commented Jun 21, 2018

Uh oh!

fenrus75 commented Jun 21, 2018 via email

Uh oh!

fenrus75 commented Jun 21, 2018 via email

Uh oh!

martin-frbg commented Jun 21, 2018

Uh oh!

oon3m0oo commented Jun 21, 2018

Uh oh!

martin-frbg commented Jun 21, 2018

Uh oh!

fenrus75 commented Jun 21, 2018 via email

Uh oh!

oon3m0oo commented Jun 25, 2018

Uh oh!

oon3m0oo commented Jun 17, 2018 •

edited

Loading