-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Threading threshold tuning needed for sgemm/dgemm #1622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
(I've run the same data replacing the sched_yield with pause but no remarkable difference) |
I have root caused this issue and have a slightly hacky solution. testing now and pondering cleanups |
The cutoff point is still at 65 for now (needs to move a little), but the "threading hump" of cost
|
The Julia folks keep telling us that GEMM_MULTITHREAD_THRESHOLD should be set at 50 rather thanthe current value of 4 that wernsaar kept defending. Could still be that the real issue lies elsewhere, some assumptions in the code may have been valid when Goto did his groundbreaking work but less so on current hardware. |
I've run similar experiments lately, and yes, somewhere from 30-60 gives the best performance overall, at least for non-complex matrices (I haven't tried that). Note that there's a cube-root relationship with the side length for determining whether to use threads. I can post the tests I've run tomorrow, when I'm back in the office. |
At least part of the issue is NOT that the threshold is wrong (it may well be, but it seems at least in the ballpark now) but that the base cost of starting threading is too high and the threading is inefficient due to a set of "performance bugs", which makes the tradeoff not work right in practice (as per data above). I have hacks for the perf bugs I found so far, and the base cost is now an order of magnitude lower.
|
next steps are convincing myself that the hacks to fix this are valid fixes and will not break stuff |
(note that an 80x80 sgemm went from 156k cycles to 40k cycles with the fixes, so I think it's worth my time to keep poking) |
Sounds intriguing but has too little detail to try an intelligent comment. Though with level3 blas there is the known issue that someone thought it necessary to leave a "USE_SIMPLE_THREADED_LEVEL3" option available, and there may also be some unfinished work in wernsaar's tree from shortly before he dropped out of the project (for whatever reason, hopefully not medical) |
I had some theory about bits moving between outer caches if task does not fit. |
has the basic code to win the performance back |
That use of _Atomic comes from #660 (comment) - I did not know any better. The other issues you found seem to be specific to building for "your" SkylakeX with support for a larger-than-necessary number of threads, so should not be able to cause regressions - while providing a nice boost to distribution-built binaries. |
If you're ok with the general direction I'll be tuning Haswell as well of course ;-) I can poke at the _Atomic and instead put proper barriers in the places where it matters; there are not THAT many. |
If you promise not to cause any time warps ? ;-) |
Causality shall be strictly preserved |
Initialize only the required subset of the jobs array, fix barriers and improve switch ratio on SkylakeX and Haswell. For issue #1622
I ran some experiments comparing single-threaded and multi-threaded builds, and there is no penalty for building with threads and the switch from multiple threads does not create a penalty either. Nice work @fenrus75 and @oon3m0oo should this issue be closed now?
|
_Atomic might shut the compiler tool up, but it does not close any race
conditions at all.
if there are race conditions in the code, they are there with or without
_Atomic.
_Atomic is not magic in that sense. it's not a substitute for locking at
all.
…On Thu, Jun 21, 2018 at 7:53 AM Martin Kroeker ***@***.***> wrote:
#1634 <#1634> is about to bring
_Atomic back, which according to d148ec4
<d148ec4>
would have an impact on performance...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1622 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFVusaLUNUMNArIqHzMCOGEWIaa_Jks5t-7NegaJpZM4UqpVI>
.
|
to be specific; to make things synchronization safe, one needs to do the
equivalent of an "atomic exchange" as a general rule; _Atomic does not do
that.
It just makes "set" more expensive...
in the openblas model, the data is just for "am I done" and that needs not
to be atomic; it needs proper barriers
(a WMB after writing, and a MB before reading, or at least an MB in the
loop that polls during read, so before the 2nd read)
I audited the file and added the places where (W)MB as missing..
now on non-GCC x86-64, MB/WMB might not be the correct primitives (e..g on
x86 they need to be compiler barriers since the architecture is observably
ordered natively)
but that should be easy to fix for an LLVM/etc person who knows how their
compiler does these
On Thu, Jun 21, 2018 at 8:16 AM Arjan van de Ven <[email protected]>
wrote:
… _Atomic might shut the compiler tool up, but it does not close any race
conditions at all.
if there are race conditions in the code, they are there with or without
_Atomic.
_Atomic is not magic in that sense. it's not a substitute for locking at
all.
On Thu, Jun 21, 2018 at 7:53 AM Martin Kroeker ***@***.***>
wrote:
> #1634 <#1634> is about to bring
> _Atomic back, which according to d148ec4
> <d148ec4>
> would have an impact on performance...
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1622 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ABPeFVusaLUNUMNArIqHzMCOGEWIaa_Jks5t-7NegaJpZM4UqpVI>
> .
>
|
Makes me wonder if these were just false positives from TSAN ? |
I've reverted that part of the change until we figure this out. I'm also wondering if an explicit synchronization mechanism would be faster than the spin waits. Was that ever tried? For fun I tried using condition variables and it was horribly slow, but there are other options. |
I think we need to start by making what is the objective more explicit;
it's hard to reason about synchronization without clearly knowing the
expectations.
So maybe something like this
There is a (sometimes large) array of work items that need to be done (e.g.
tiles in the matrix multiply)
In the threaded environment, coordination between threads is needed for
1) which of the work blocks have work started on them by a thread to avoid
multiple threads starting work on the same block
2) which work blocks have completed to avoid redoing work that's already
done
3) data start dependencies; some blocks cannot be started before other work
blocks are completed
4) data end dependencies; some blocks cannot complete their work before
other blocks are completed
5) total job completion; the overall job is not completed (and the main
caller cannot continue) until all work blocks are completed
…On Thu, Jun 21, 2018 at 8:56 AM Martin Kroeker ***@***.***> wrote:
Nobody willing or able to try as far as I remember. Mentioned in #731
<#731> (which got closed rather
quickly after fixing a side topic) and I think #923
<#923> (guess who started it :-)
and it may actually be resolved by your combined work now)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1622 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFQNPv1hKAt76vTtqhEjNHd5AR65pks5t-8ItgaJpZM4UqpVI>
.
|
I think those are exactly how the code is structured now, and having tried both eventfds and futexes in an attempt to make it fast, I have to conclude that spin-waiting is indeed the best thing to do. I have, however, put up a PR (#1642) to make threading tools happy, if that's acceptable. |
Based on the data below, the threading threshold tuning might need some help; if you happen to have a good pointer then I'll play and help
The baseline for performance is this:
The Cycles-per-multiply metric is likely the most useful as performance metric; for Float the hardware in question has a theoretical limit of 0.03125 CPM and we get close to half of theoretical as matrixes get bigger.
Once threading gets enabled (20 logical cpus, 10 physical cores on the system) things get interesting:
(the percentages is performance delta to baseline, negative numbers are performance loss)
For very small matrixes there is a little bit of overhead, but thanks to @oon3m0oo and @sandwichmaker, this overhead is pretty tiny.
HOWEVER, once threading kicks in (just after 64x64) performance tanks compares to the baseline, and does not recover until 200x200 size.
This is not related to OpenMP-versus-threads, with OpenMP the data looks like this:
Which shows OpenMP has a bit more baseline overhead, but has otherwise the same problem after 64x64 to 200x200
So my conclusion is that the threading kicks in at too small matrices currently.... if it started at 200x200 then there would be a win across the board.
The text was updated successfully, but these errors were encountered: