pthread locking oddities as exemplified by test/sblat2 #995

martin-frbg · 2016-10-25T16:28:49Z

In the discussion of #990, @juliantaylor mentioned seeing infrequent lockups with pthreads that were exacerbated by attempts to speed up or remove YIELDING. One simple test case he mentioned was the sblat2 test, so I tried to run the "helgrind" thread debugger from the valgrind tool suite (valgrind.org, pertinent documentation at http://valgrind.org/docs/manual/hg-manual.html ) on a DEBUG=1 build of the current develop branch with test/sblat2 <./sblat2.dat . (Hardware involved is a lowly Atom but this should not matter at least for the init phase, I can rerun with better hardware if necessary.)

helgrind log is here: sblat2.txt

The initial lock acquisition order error it records looks to be a gfortran (4.8.5) glitch, things get interesting after the Thread-Announcement heading.
At first glance, the subsequent messages suggest that in driver/others/memory.c the "LOCK_COMMAND" in line 969 (and UNLOCK in 1003) should be moved outside the scope of the "if..." that immediately precedes it (in order to guard reads as well as writes to memory_initialized).
Also in that file, the alloc_mmap() function for the NO_WARMUP case (line 395 onwards) seems to need locks around the assignments to release_info and in particular the incrementing of release_pos.

Comments from people more thread-aware than myself greatly appreciated...

martin-frbg · 2016-12-22T14:25:48Z

Just as a quick headsup, I succeeded in making the helgrind warnings in memory.c go away by adding locks around the release_info[] changes in alloc_mmap() and the accesses to memory[position] in blas_memory_alloc()/blas_memory_free(). (Some of the spots in blas_memory_alloc were already guarded by locks conditional on the value of memory[position].used, which seems to make little sense as that itself is subject to races.)
I ran out of time and ideas trying to fix blas_server.c, best I could achieve there is getting two threads each holding a different lock pull the rug under each others feet. Hope to revisit this over the holidays and ideally create a PR for perusal then.

martin-frbg · 2017-01-04T15:52:51Z

Updated helgrind log (now using helgrind from current valgrind 3.12.0 on Skylake):
helgrind.txt
and after adding and shuffling locks in blas_server.c and memory.c
helgrind-after_changes.txt
Most of what remains appears to be the fault of gfortran as mentioned previously; I am not sure how to resolve the one potential race involving two different locks but as it is in thread shutdown it may be harmless. So far I have only checked that all tests still pass.

martin-frbg · 2017-01-04T17:53:11Z

Changed files ~~are currently stored as a pull request within my own fork~~, comments welcome
995.zip

martin-frbg mentioned this issue Oct 27, 2016

Deadlock in threaded dgemv #660

Closed

martin-frbg mentioned this issue Jan 8, 2017

Fix thread data races detected by helgrind 3.12 #1052

Merged

martin-frbg closed this as completed Jan 9, 2017

martin-frbg mentioned this issue Jan 23, 2017

Race Condition in Multithreaded OpenBLAS on IBM OpenPower 8 #1071

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pthread locking oddities as exemplified by test/sblat2 #995

pthread locking oddities as exemplified by test/sblat2 #995

martin-frbg commented Oct 25, 2016

martin-frbg commented Dec 22, 2016

Uh oh!

martin-frbg commented Jan 4, 2017

Uh oh!

martin-frbg commented Jan 4, 2017 •

edited

Loading

Uh oh!

pthread locking oddities as exemplified by test/sblat2 #995

pthread locking oddities as exemplified by test/sblat2 #995

Comments

martin-frbg commented Oct 25, 2016

martin-frbg commented Dec 22, 2016

Uh oh!

martin-frbg commented Jan 4, 2017

Uh oh!

martin-frbg commented Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Jan 4, 2017 •

edited

Loading