Skip to content

Data race when reading / writing join handle waker #2087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tmiasko opened this issue Jan 10, 2020 · 11 comments · Fixed by #2096
Closed

Data race when reading / writing join handle waker #2087

tmiasko opened this issue Jan 10, 2020 · 11 comments · Fixed by #2096

Comments

@tmiasko
Copy link
Contributor

tmiasko commented Jan 10, 2020

WARNING: ThreadSanitizer: data race (pid=70621)
  Read of size 8 at 0x7b4c00010100 by thread T9:
    #0 memcpy /home/tm/test/rust/src/llvm-project/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:801:5 (fs_dir-663770699b1ccc75+0x52626)
    #1 memcpy /home/tm/test/rust/src/llvm-project/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:793:1 (fs_dir-663770699b1ccc75+0x52626)
    #2 core::intrinsics::copy_nonoverlapping::h11aaea6dc78c791a /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/intrinsics.rs:1523:4 (fs_dir-663770699b1ccc75+0x3cfb87)
    #3 core::ptr::read::h63b05f9b903d79d1 /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/ptr/mod.rs:627:4 (fs_dir-663770699b1ccc75+0x384263)
    #4 core::ptr::const_ptr::_$LT$impl$u20$$BP$const$u20$T$GT$::read::hd1c0cc84b20ec616 /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/ptr/const_ptr.rs:585:8 (fs_dir-663770699b1ccc75+0x3ce74d)
    #5 tokio::task::harness::Harness$LT$T$C$S$GT$::read_join_waker::_$u7b$$u7b$closure$u7d$$u7d$::hf2823eccbaf339f0 /home/tm/tokio/tokio/src/task/harness.rs:548:49 (fs_dir-663770699b1ccc75+0x32578d)
    #6 tokio::loom::std::causal_cell::CausalCell$LT$T$GT$::with_deferred::hb0c5f4e5331d78d0 /home/tm/tokio/tokio/src/loom/std/causal_cell.rs:34:9 (fs_dir-663770699b1ccc75+0x316dba)
    #7 tokio::task::harness::Harness$LT$T$C$S$GT$::read_join_waker::h8ca44083f528b659 /home/tm/tokio/tokio/src/task/harness.rs:548:8 (fs_dir-663770699b1ccc75+0x3255b6)
    #8 tokio::task::harness::Harness$LT$T$C$S$GT$::transition_to_released::h4e23461a410c2fa2 /home/tm/tokio/tokio/src/task/harness.rs:473:47 (fs_dir-663770699b1ccc75+0x324203)
    #9 tokio::task::harness::Harness$LT$T$C$S$GT$::complete::hce95ce7f0d9fa8c0 /home/tm/tokio/tokio/src/task/harness.rs:424:19 (fs_dir-663770699b1ccc75+0x3238c8)
    #10 tokio::task::harness::Harness$LT$T$C$S$GT$::poll::hc812e89064cec4b4 /home/tm/tokio/tokio/src/task/harness.rs:131:16 (fs_dir-663770699b1ccc75+0x31c282)
    #11 tokio::task::raw::poll::hb64457ce8ac61483 /home/tm/tokio/tokio/src/task/raw.rs:162:4 (fs_dir-663770699b1ccc75+0x3edd94)
    #12 tokio::task::raw::RawTask::poll::h226144d0c703dae2 /home/tm/tokio/tokio/src/task/raw.rs:113:8 (fs_dir-663770699b1ccc75+0x35a3ab)
    #13 tokio::task::Task$LT$S$GT$::run::h237d0766d9400fdd /home/tm/tokio/tokio/src/task/mod.rs:371:16 (fs_dir-663770699b1ccc75+0x35e540)
    ...
    #33 std::sys::unix::thread::Thread::new::thread_start::h01caf79998f1ad89 /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libstd/sys/unix/thread.rs:80:16 (fs_dir-663770699b1ccc75+0x4539c9)

  Previous write of size 8 at 0x7b4c00010100 by thread T4:
    #0 memcpy /home/tm/test/rust/src/llvm-project/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:801:5 (fs_dir-663770699b1ccc75+0x52626)
    #1 memcpy /home/tm/test/rust/src/llvm-project/compiler-rt/lib/tsan/../sanitizer_common/sanitizer_common_interceptors.inc:793:1 (fs_dir-663770699b1ccc75+0x52626)
    #2 core::intrinsics::copy_nonoverlapping::h1d4700678bc5a4c3 /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/intrinsics.rs:1523:4 (fs_dir-663770699b1ccc75+0x42ddc7)
    #3 core::ptr::swap_nonoverlapping_one::h25a5b1304aa688d1 /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/ptr/mod.rs:415:8 (fs_dir-663770699b1ccc75+0x42d39e)
    #4 core::mem::swap::hbfcdd49c70a3a4fd /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/mem/mod.rs:556:8 (fs_dir-663770699b1ccc75+0x42cd81)
    #5 core::ptr::replace::hb5f65ac8ac97857b /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/ptr/mod.rs:512:4 (fs_dir-663770699b1ccc75+0x38738e)
    #6 core::ptr::mut_ptr::_$LT$impl$u20$$BP$mut$u20$T$GT$::replace::h3685cb12b0499d13 /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libcore/ptr/mut_ptr.rs:813:8 (fs_dir-663770699b1ccc75+0x33e3f6)
    #7 tokio::task::harness::Harness$LT$T$C$S$GT$::store_join_waker::_$u7b$$u7b$closure$u7d$$u7d$::h87e3b6535fbef333 /home/tm/tokio/tokio/src/task/harness.rs:223:16 (fs_dir-663770699b1ccc75+0x31efa9)
    #8 tokio::loom::std::causal_cell::CausalCell$LT$T$GT$::with_mut::h811530a1661dc9d6 /home/tm/tokio/tokio/src/loom/std/causal_cell.rs:41:8 (fs_dir-663770699b1ccc75+0x317b48)
    #9 tokio::task::harness::Harness$LT$T$C$S$GT$::store_join_waker::hb11c71d9fc75b0a2 /home/tm/tokio/tokio/src/task/harness.rs:222:12 (fs_dir-663770699b1ccc75+0x31eb5c)
    #10 tokio::task::harness::Harness$LT$T$C$S$GT$::swap_join_waker::hf90737b941e0e5e7 /home/tm/tokio/tokio/src/task/harness.rs:254:23 (fs_dir-663770699b1ccc75+0x31faf2)
    #11 tokio::task::raw::swap_join_waker::h51e59c4d9f2dca15 /home/tm/tokio/tokio/src/task/raw.rs:186:4 (fs_dir-663770699b1ccc75+0x3ee391)
    #12 tokio::task::raw::RawTask::swap_join_waker::h5d8a5b55e194f988 /home/tm/tokio/tokio/src/task/raw.rs:135:17 (fs_dir-663770699b1ccc75+0x35a7d3)
    #13 _$LT$tokio..task..join..JoinHandle$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hfa937a1b34c2244d /home/tm/tokio/tokio/src/task/join.rs:115:16 (fs_dir-663770699b1ccc75+0x359b30)
    #14 tokio::fs::read_dir::ReadDir::poll_next_entry::hd82b6d4f7597b6c7 /home/tm/tokio/tokio/src/fs/read_dir.rs:71:44 (fs_dir-663770699b1ccc75+0x3dc14f)
    ...
    #55 std::sys::unix::thread::Thread::new::thread_start::h01caf79998f1ad89 /home/tm/test/rust/build/x86_64-unknown-linux-gnu/stage1/lib/rustlib/src/rust/src/libstd/sys/unix/thread.rs:80:16 (fs_dir-663770699b1ccc75+0x4539c9)

Reproducible when running tests, especially those from fs_dir.rs.

So far my understanding who has access to the waker field is following:

Join handle attempts to obtain a write lock on waker field by removing
JOIN_WAKER flag. If removal happened before task was marked as COMPLETE or
CANCELLED, attempt succeeded, otherwise it has failed.

The task obtains a read lock on a waker field by transitioning to COMPLETE or
CANCELLED state. If JOIN_WAKER flag was set in the state just before
transition then it holds the read lock, otherwise it does not.

If above is correct, then in

let res1 = self.transition_to_complete(join_interest);
// At this point, the join waker may not be changed. Once we perform
// `release_task` we may no longer read from the struct but we
// **may** be responsible for dropping the waker. We do an
// optimistic read here.
let (join_waker, check) = unsafe { self.read_join_waker() };
let res2 = self.header().state.release_task();
if res1.has_join_waker() && !res2.is_join_interested() {
debug_assert!(res2.has_join_waker());
// Its our responsibility to drop the waker
check.check();
unsafe {
drop(join_waker.assume_init());
}
}
the self.read_join_waker() should be conditional on res1.has_join_waker(). With a change along those lines I can no longer reproduce the issue.

@carllerche
Copy link
Member

TSAN is prone to false positives, which is one reason why I moved away from it.

In this case, we are doing a read as "uninitialized memory" but not assuming it is initialized until we can validate that we we are able to read that memory. You can see that the actual observation of the data only happens if res1.has_join_waker().

Unfortunately, TSAN is unable to validate this pattern.

Now, the question is whether or not this pattern is a data race. I discussed this some with @RalfJung. What I do know is that established and studied algorithms, such as the chase-lev deque, uses such a pattern.

We could guard read_join_waker() with res1.has_join_waker() and that would reduce the amount of TSAN errors, but it would not eliminate them entirely.

It seems that you have an interest in the concurrency details in Tokio. Why don't you ping me in discord (https://discord.gg/tokio) and we can talk more. I would love to get more help w/ this & loom.

@RalfJung
Copy link

AFAIK this is the pattern where I told you that yes, this is a data race and thus UB under the rules of C/C++, and thus also under the rules of Rust?

In this case, we are doing a read as "uninitialized memory" but not assuming it is initialized until we can validate that we we are able to read that memory. You can see that the actual observation of the data only happens if res1.has_join_waker().

This is arguing with LLVM's rules (where read-write races make the read return undef). Rust makes no guarantee that it will always work like LLVM does.

You are not alone with this though, crossbeam's AtomicCell makes the same argument.

What I do know is that established and studied algorithms, such as the chase-lev deque, uses such a pattern.

Indeed, there are established and studied algorithms that cannot be implemented in C/C++; see for example this paper.

@carllerche
Copy link
Member

@RalfJung How do you suggest handling this case then?

@carllerche
Copy link
Member

@RalfJung I guess, are you suggesting that, in Rust, we avoid the "speculative racing load" pattern?

@tmiasko
Copy link
Contributor Author

tmiasko commented Jan 10, 2020

My experience with thread sanitizer is quite different. Apart from custom
assembly used for synchronization, in properly instrumented code I generally
encountered a single source of false positives: i.e., lack of support for
memory fences.

In the context of tokio, I have seen only one additional data race. It is not
even in tokio, but rather in mio (in code that was since removed on master
branch). Though to make it clear: I had to instrument the four fences in tokio,
and I am using instrumented standard library, which nowadays is fortunately
quite easy thanks to cargo build-std functionality.

There are memory models where data race merely results in an undefined value,
which may trigger undefined behaviour only when used. But Rust uses C memory
model where data race is simply undefined behaviour.

@RalfJung
Copy link

RalfJung commented Jan 10, 2020

If you want to make your code match the Rust spec as it stands right now, the only option is to adjust loom to consider such things data races, and adjust the code to avoid them.

If you are fine with causing spec-UB that we know current compilers don't actually turn into miscompilations (and want to be added to rust-lang/unsafe-code-guidelines#158), you can continue as before. (Though as precautionary measure, I recommend using a volatile_read for the racy read. This is what AtomicCell does. This helps mark that something weird is going on and reduces the risk of compiler interference. It won't make this any less spec-violating, though, under the current volatile semantics.) I suggest closely monitoring the usual channels for any attempt to do MIR-level optimizations on atomics as those might break your code (but I don't see that coming any time soon). Also you will have to worry about alternative backends; e.g., I have no idea what cranelift has to say here. Moreover you could try to start lobbying for officially switching Rust to an LLVM-style memory model -- basically adjusting the spec to your needs. ;) You have some pretty good arguments for that (in the form of real code). However, I am on the fence about this because LLVM's model has been subject to much less formal scrutiny than C++'s, so I expect some bad surprises to lurk there -- tying Rust to the fate of that memory model might end up being a bad call.

I guess, are you suggesting that, in Rust, we avoid the "speculative racing load" pattern?

Yes. Whenever possible, I suggest to follow the spec, while also arguing for improving the spec. Trying to "force" spec changes by acting as if they already happened puts the lang team in a bad place (see for example the entire unwind-through-extern disaster). This is the equivalent of using a nightly feature on stable before the design is finished; there just isn't a way to have feature flags for UB.

But I also understand that sometimes there just isn't a way to comply with the spec, and then it's not like you have much choice -- but that makes it even more important to seek the dialogue with the lang team to make them aware that there is a problem here and to start working towards a solution.

@carllerche
Copy link
Member

carllerche commented Jan 10, 2020

@tmiasko yes, fences are not supported... and they are used (both in tokio and std), which results in false positives and a brittle wack-a-mole whitelisting process (when the backtrace changes, then the whitelist needs to be updated).

@carllerche
Copy link
Member

@tmiasko and also the usage of "speculative racy loads" (which, as you can see, is a grey area).

@carllerche
Copy link
Member

There is another case of speculative racy load in the queue. I'm not sure if there is a good way to avoid this one: https://github.com/tokio-rs/tokio/blob/master/tokio/src/runtime/thread_pool/queue/local.rs#L233-L287

@tmiasko
Copy link
Contributor Author

tmiasko commented Jan 10, 2020

I have an implementation of #[no_sanitize(thread)] attribute that disables
error reporting for annotated function. Seems like it could be well suited for
cases where you intentionally cause a data race. Not upstream yet, since so far
I generally found other solutions.

I don't really see fences as a big issue, though it does prevent thread
sanitizer from "just working". Those in std and tokio can be easily replaced.
I hope it will be possible to land changes in std that replaces them with
something that is understood with thread sanitizer, at least when it is enabled
(since it is possible to do that conditionally).

@RalfJung
Copy link

Well, it's not really a "grey area" in the sense of "the spec is unclear". The spec is pretty clear. But it is a well-known deficiency of the spec.

carllerche pushed a commit that referenced this issue Jan 29, 2020
The previous implementation would perform a load that might be part of a
data race. The value read would be used only when race did not occur.
This would be well defined in a memory model where a load that is a part
of race merely returns an undefined value, the Rust memory model on the
other hand defines it to be undefined behaviour.

Perform read conditionally to avoid data race.

Covered by existing loom tests after changing casualty check to be
immediate rather than deferred.

Fixes: #2087
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants