Skip to content

pthread locking oddities as exemplified by test/sblat2 #995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
martin-frbg opened this issue Oct 25, 2016 · 3 comments
Closed

pthread locking oddities as exemplified by test/sblat2 #995

martin-frbg opened this issue Oct 25, 2016 · 3 comments

Comments

@martin-frbg
Copy link
Collaborator

In the discussion of #990, @juliantaylor mentioned seeing infrequent lockups with pthreads that were exacerbated by attempts to speed up or remove YIELDING. One simple test case he mentioned was the sblat2 test, so I tried to run the "helgrind" thread debugger from the valgrind tool suite (valgrind.org, pertinent documentation at http://valgrind.org/docs/manual/hg-manual.html ) on a DEBUG=1 build of the current develop branch with test/sblat2 <./sblat2.dat . (Hardware involved is a lowly Atom but this should not matter at least for the init phase, I can rerun with better hardware if necessary.)

helgrind log is here: sblat2.txt

The initial lock acquisition order error it records looks to be a gfortran (4.8.5) glitch, things get interesting after the Thread-Announcement heading.
At first glance, the subsequent messages suggest that in driver/others/memory.c the "LOCK_COMMAND" in line 969 (and UNLOCK in 1003) should be moved outside the scope of the "if..." that immediately precedes it (in order to guard reads as well as writes to memory_initialized).
Also in that file, the alloc_mmap() function for the NO_WARMUP case (line 395 onwards) seems to need locks around the assignments to release_info and in particular the incrementing of release_pos.

Comments from people more thread-aware than myself greatly appreciated...

@martin-frbg
Copy link
Collaborator Author

Just as a quick headsup, I succeeded in making the helgrind warnings in memory.c go away by adding locks around the release_info[] changes in alloc_mmap() and the accesses to memory[position] in blas_memory_alloc()/blas_memory_free(). (Some of the spots in blas_memory_alloc were already guarded by locks conditional on the value of memory[position].used, which seems to make little sense as that itself is subject to races.)
I ran out of time and ideas trying to fix blas_server.c, best I could achieve there is getting two threads each holding a different lock pull the rug under each others feet. Hope to revisit this over the holidays and ideally create a PR for perusal then.

@martin-frbg
Copy link
Collaborator Author

Updated helgrind log (now using helgrind from current valgrind 3.12.0 on Skylake):
helgrind.txt
and after adding and shuffling locks in blas_server.c and memory.c
helgrind-after_changes.txt
Most of what remains appears to be the fault of gfortran as mentioned previously; I am not sure how to resolve the one potential race involving two different locks but as it is in thread shutdown it may be harmless. So far I have only checked that all tests still pass.

@martin-frbg
Copy link
Collaborator Author

martin-frbg commented Jan 4, 2017

Changed files are currently stored as a pull request within my own fork, comments welcome
995.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant