-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support Firecracker clean teardown - prerequisite for rust sanitizers and/or Valgrind #2599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Really great job with this PR! Things are clear and well explained. I have tried running the test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still got a couple more commits to review, but looking good so far.
I am wondering whether there is significant impact on the shutdown time. Do you view this as critical/something to consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just don't forget to replace the LOG.error() gc-plp mentioned
@berciuliviu @alindima addressed all comments, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
The design of the existing main() function was such that Firecracker never unwound the stack to the top level in order to end the program. Instead, it would run some Firecracker-specific shutdown code and then call `unsafe { libc::_exit() }`. That skips some Rust-specific shutdown done by std::process::exit(). But even using that does not call destructors for the stack objects, so it is not a "clean shutdown": https://doc.rust-lang.org/std/process/fn.exit.html While a "dirty shutdown" is easier, the benefit of having non-panic shutdowns run their finalization is that it helps with the accounting of Valgrind and sanitizers. It's a good sanity check to make sure the program could exit cleanly if it wanted to. This commit attempts a minimally-invasive means of bubbling up an "ExitCode". Signed-off-by: Adrian Catangiu <[email protected]> Co-authored-by: Cerul Alain <[email protected]>
We want to cleanly exit the API thread so the firecracker process can gracefully teardown. Because the API Server thread is in a loop that does a blocking wait on a socket, the only way to tell it to stop looping is via information on that socket. This uses an internal HTTP PUT signal to request that shutdown. Rather than forget the thread handle and the socket bind location, the main thread now sends the shutdown request and then waits while the thread joins. The request is for internal use only, however--because it is supposed to happen after the Vmm has been cleaned up and its contents destructed. A different approach for signaling the thread for this purpose could be used, e.g. to have the API server thread waiting on multiple events (the socket or this internal signal). But since this is only needed for clean shutdown, it is probably good enough. Signed-off-by: Adrian Catangiu <[email protected]> Co-authored-by: Cerul Alain <[email protected]>
The PrebootApiController in RPC-interface holds all preboot control logic. It only makes sense it should also handle fatal errors during pre-boot API calls. Remove workaround `libc::exit()` from api-server and cleanly bubble up any fatal errors from RPC controller to main thread cleanup. Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
This continues the work from prior commits toward the goal of releasing all resources with a clean shutdown. The VCPU threads in non-test configurations were running with an unbreakable wait, so they only way to get past them was with libc::exit(). This commit changes that, along with several other entangled issues. One was that Drop methods were used to try and clean up the Vmm and the VcpuHandles, but these were not being called. The reason is that there was a cyclic dependency: threads held onto the Vmm, so the Vmm could not exit VCPU threads in its Drop method (as it wouldn't be dropped so long as the VCPU threads were around). Another complexity surrounds defining the difference between an exited thread (from which an exit event can be read out of the channel) and a finished thread (which tears down the channel). Trying to send a message to a thread which has finished will panic, and some CPUs might be in the exit state while others may not...and reading the "exit state" by taking the message out of the channel removes it. It's a bit complicated. This commit tries to move toward simplification by being explicit about the Vmm.stop() call being responsible for exiting the VCPUs, and making the VcpuHandle's Drop method only do the simple task of joining an exited thread. The overloaded cyclical 'exit_evt' is renamed in this commit to 'vcpu_exit_event' and documented that is should be read-only to the Vmm. To avoid cycles, all teardown paths take the following route: +------------------+------------------------+----------------------+ | Vmm | Action | Vcpu | +------------------+------------------------+----------------------+ 1 | | | vcpu.exit(exit_code) | 2 | | |vcpu.exit_evt.write(1)| 3 | |<-- EventFd::exit_evt --| | 4 | vmm.stop() | | | 5 | |-- VcpuEvent::Finish -->| | 6 | | |StateMachine::finish()| 7 |VcpuHandle::join()| | | | | | | 8 |vmm.shutdown_exit_code = Some(code) breaking the main event loop | +------------------+------------------------+----------------------+ Vcpu initiated teardown starts from `fn Vcpu::exit()` (step 1). Vmm initiated teardown starts from `pub fn Vmm::stop()` (step 4). Once `vmm.shutdown_exit_code` becomes `Some(exit_code)`, it is the upper layer's responsibility to break main event loop and propagate the exit code value. Signed-off-by: Adrian Catangiu <[email protected]> Co-authored-by: Cerul Alain <[email protected]>
Firecracker's Rust integration tests were very hard to work with because Firecracker was doing 'libc::exit()' which would also exit the cargo-test process. As a workaround to that the tests' payload was running under a separate forked process. The problem with that was that it offered zero debuggability, the only thing reported being pass/fail. Using the new clean exit model, tests have been reworked to scrap the extra 'libc::fork()' and work in the common testing model where each test can run on a separate thread in the same process. Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Ideally we'd want a general way to include the relevant microvm logs in the test report for any failing tests. Since that's not possible/easy with current framework, at least explicitly include logs for any microvm that is killed by a signal. At the very least this provides visibility into any seccomp failures, and other traps the microvm process might hit during tests. Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Signed-off-by: Adrian Catangiu <[email protected]>
Glad it worked out! |
Reason for This PR
cargo
instead of being forked as separate processes. This drastically improves maintenance and debuggability (with oldfork()
approach we have no visibility into why a test fails, we just see that it does).valgrind
test to check for memory leaks around our FFI interfaces.Description of Changes
This PR is based on #2535 and some of the commits are co-authored by
Cerul Alain <[email protected]>
.This PR adds multiple incremental commits all working toward one goal: to allow Firecracker to propagate any exit reason all the way to the top level
main()
function and have a single program exit point while running all destructors and cleanup code during teardown of all resources.For this to be possible, the following high-level steps were taken (per-step details enclosed in their respective commit):
Vcpu
teardown to allow thread main function to end instead of blocking forever,Vmm
manages and exposes anshutdown_exit_code
that is used by the higher layers to stop the main loop in whenVmm
is shut down,libc::exit()
points across the codebase have been unified into a natural flow of propagating errors to above layers withmain()
being responsible for exiting the process,End result is:
Vcpu
andVmm
stopping protocolLicense Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.
PR Checklist
[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]
git commit -s
).Any required documentation changes (code and docs) are included in this PR.Any newly addedunsafe
code is properly documented.Any API changes are reflected infirecracker/swagger.yaml
.Any user-facing changes are mentioned inCHANGELOG.md
.