Support Firecracker clean teardown - prerequisite for rust sanitizers and/or Valgrind #2599

acatangiu · 2021-05-28T14:23:31Z

Reason for This PR

We want to simplify Rust integration tests to allow them to run under cargo instead of being forked as separate processes. This drastically improves maintenance and debuggability (with old fork() approach we have no visibility into why a test fails, we just see that it does).
We want to be able to run Rust sanitizers on the Firecracker unit and integration tests for further code-base/security hardening.
Check for Memory Leaks #119 we might want to add valgrind test to check for memory leaks around our FFI interfaces.

Description of Changes

This PR is based on #2535 and some of the commits are co-authored by Cerul Alain <[email protected]>.

This PR adds multiple incremental commits all working toward one goal: to allow Firecracker to propagate any exit reason all the way to the top level main() function and have a single program exit point while running all destructors and cleanup code during teardown of all resources.

For this to be possible, the following high-level steps were taken (per-step details enclosed in their respective commit):

slightly reworked Vcpu teardown to allow thread main function to end instead of blocking forever,
added internal api-endpoint used to trigger API thread teardown,
reworked the Vmm stopping/exit logic to remove existing cyclic dependencies that would not allow destructors to be run,
Vmm manages and exposes an shutdown_exit_code that is used by the higher layers to stop the main loop in when Vmm is shut down,
many libc::exit() points across the codebase have been unified into a natural flow of propagating errors to above layers with main() being responsible for exiting the process,
seccomp filters adjusted to allow syscalls done by Rust during process cleanup.

End result is:

a cleaner model that's easier to understand and work with,
cleaner Rust integration tests that are now easy to debug when failing,
removed weird workarounds like ApiServer running special logic to interpret Vmm responses to decide if it needs to kill process after responding
clean, cycle-free Vcpu and Vmm stopping protocol

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license.

PR Checklist

[Author TODO: Meet these criteria.]
[Reviewer TODO: Verify that these criteria are met. Request changes if not]

All commits in this PR are signed (git commit -s).
The reason for this PR is clearly provided (issue no. or explanation).
The description of changes is clear and encompassing.
~~Any required documentation changes (code and docs) are included in this PR.~~
~~Any newly added unsafe code is properly documented.~~
~~Any API changes are reflected in firecracker/swagger.yaml.~~
~~Any user-facing changes are mentioned in CHANGELOG.md.~~
All added/changed functionality is tested.

tests/framework/microvm.py

berciuliviu · 2021-05-31T15:09:34Z

Really great job with this PR! Things are clear and well explained. I have tried running the test test_failing_filter on my machine, but it passes everytime. Maybe a rebase with the latest changes on main and a test-rerun could do the trick?

alindima

I still got a couple more commits to review, but looking good so far.

I am wondering whether there is significant impact on the shutdown time. Do you view this as critical/something to consider?

src/vmm/src/lib.rs

berciuliviu

LGTM, just don't forget to replace the LOG.error() gc-plp mentioned

src/vmm/src/lib.rs

src/vmm/tests/integration_tests.rs

resources/seccomp/aarch64-unknown-linux-musl.json

resources/seccomp/x86_64-unknown-linux-musl.json

acatangiu · 2021-06-02T11:14:25Z

@berciuliviu @alindima addressed all comments, PTAL

alindima

Great job! LGTM

berciuliviu

LGTM!

The design of the existing main() function was such that Firecracker never unwound the stack to the top level in order to end the program. Instead, it would run some Firecracker-specific shutdown code and then call `unsafe { libc::_exit() }`. That skips some Rust-specific shutdown done by std::process::exit(). But even using that does not call destructors for the stack objects, so it is not a "clean shutdown": https://doc.rust-lang.org/std/process/fn.exit.html While a "dirty shutdown" is easier, the benefit of having non-panic shutdowns run their finalization is that it helps with the accounting of Valgrind and sanitizers. It's a good sanity check to make sure the program could exit cleanly if it wanted to. This commit attempts a minimally-invasive means of bubbling up an "ExitCode". Signed-off-by: Adrian Catangiu <[email protected]> Co-authored-by: Cerul Alain <[email protected]>

We want to cleanly exit the API thread so the firecracker process can gracefully teardown. Because the API Server thread is in a loop that does a blocking wait on a socket, the only way to tell it to stop looping is via information on that socket. This uses an internal HTTP PUT signal to request that shutdown. Rather than forget the thread handle and the socket bind location, the main thread now sends the shutdown request and then waits while the thread joins. The request is for internal use only, however--because it is supposed to happen after the Vmm has been cleaned up and its contents destructed. A different approach for signaling the thread for this purpose could be used, e.g. to have the API server thread waiting on multiple events (the socket or this internal signal). But since this is only needed for clean shutdown, it is probably good enough. Signed-off-by: Adrian Catangiu <[email protected]> Co-authored-by: Cerul Alain <[email protected]>

The PrebootApiController in RPC-interface holds all preboot control logic. It only makes sense it should also handle fatal errors during pre-boot API calls. Remove workaround `libc::exit()` from api-server and cleanly bubble up any fatal errors from RPC controller to main thread cleanup. Signed-off-by: Adrian Catangiu <[email protected]>

Signed-off-by: Adrian Catangiu <[email protected]>

This continues the work from prior commits toward the goal of releasing all resources with a clean shutdown. The VCPU threads in non-test configurations were running with an unbreakable wait, so they only way to get past them was with libc::exit(). This commit changes that, along with several other entangled issues. One was that Drop methods were used to try and clean up the Vmm and the VcpuHandles, but these were not being called. The reason is that there was a cyclic dependency: threads held onto the Vmm, so the Vmm could not exit VCPU threads in its Drop method (as it wouldn't be dropped so long as the VCPU threads were around). Another complexity surrounds defining the difference between an exited thread (from which an exit event can be read out of the channel) and a finished thread (which tears down the channel). Trying to send a message to a thread which has finished will panic, and some CPUs might be in the exit state while others may not...and reading the "exit state" by taking the message out of the channel removes it. It's a bit complicated. This commit tries to move toward simplification by being explicit about the Vmm.stop() call being responsible for exiting the VCPUs, and making the VcpuHandle's Drop method only do the simple task of joining an exited thread. The overloaded cyclical 'exit_evt' is renamed in this commit to 'vcpu_exit_event' and documented that is should be read-only to the Vmm. To avoid cycles, all teardown paths take the following route: +------------------+------------------------+----------------------+ | Vmm | Action | Vcpu | +------------------+------------------------+----------------------+ 1 | | | vcpu.exit(exit_code) | 2 | | |vcpu.exit_evt.write(1)| 3 | |<-- EventFd::exit_evt --| | 4 | vmm.stop() | | | 5 | |-- VcpuEvent::Finish -->| | 6 | | |StateMachine::finish()| 7 |VcpuHandle::join()| | | | | | | 8 |vmm.shutdown_exit_code = Some(code) breaking the main event loop | +------------------+------------------------+----------------------+ Vcpu initiated teardown starts from `fn Vcpu::exit()` (step 1). Vmm initiated teardown starts from `pub fn Vmm::stop()` (step 4). Once `vmm.shutdown_exit_code` becomes `Some(exit_code)`, it is the upper layer's responsibility to break main event loop and propagate the exit code value. Signed-off-by: Adrian Catangiu <[email protected]> Co-authored-by: Cerul Alain <[email protected]>

Firecracker's Rust integration tests were very hard to work with because Firecracker was doing 'libc::exit()' which would also exit the cargo-test process. As a workaround to that the tests' payload was running under a separate forked process. The problem with that was that it offered zero debuggability, the only thing reported being pass/fail. Using the new clean exit model, tests have been reworked to scrap the extra 'libc::fork()' and work in the common testing model where each test can run on a separate thread in the same process. Signed-off-by: Adrian Catangiu <[email protected]>

Signed-off-by: Adrian Catangiu <[email protected]>

Ideally we'd want a general way to include the relevant microvm logs in the test report for any failing tests. Since that's not possible/easy with current framework, at least explicitly include logs for any microvm that is killed by a signal. At the very least this provides visibility into any seccomp failures, and other traps the microvm process might hit during tests. Signed-off-by: Adrian Catangiu <[email protected]>

Signed-off-by: Adrian Catangiu <[email protected]>

AE1020 · 2021-06-03T18:32:31Z

Glad it worked out!

acatangiu self-assigned this May 28, 2021

acatangiu mentioned this pull request May 28, 2021

Clean Shutdown on Debug Builds (Valgrind-friendly) #2535

Closed

6 tasks

alindima self-requested a review May 28, 2021 14:39

gbionescu reviewed May 28, 2021

View reviewed changes

tests/framework/microvm.py Show resolved Hide resolved

gbionescu reviewed May 28, 2021

View reviewed changes

tests/framework/microvm.py Outdated Show resolved Hide resolved

gbionescu reviewed May 28, 2021

View reviewed changes

tests/framework/microvm.py Show resolved Hide resolved

dianpopa added Status: Awaiting review Indicates that a pull request is ready to be reviewed Status: Author labels May 31, 2021

alindima reviewed May 31, 2021

View reviewed changes

src/vmm/src/lib.rs Outdated Show resolved Hide resolved

acatangiu force-pushed the clean-shutdown branch from 513e6ed to 4c9f41d Compare June 1, 2021 09:24

acatangiu removed the Status: Author label Jun 1, 2021

berciuliviu previously approved these changes Jun 2, 2021

View reviewed changes

alindima reviewed Jun 2, 2021

View reviewed changes

src/vmm/src/lib.rs Outdated Show resolved Hide resolved

src/vmm/tests/integration_tests.rs Outdated Show resolved Hide resolved

resources/seccomp/aarch64-unknown-linux-musl.json Show resolved Hide resolved

resources/seccomp/aarch64-unknown-linux-musl.json Show resolved Hide resolved

alindima reviewed Jun 2, 2021

View reviewed changes

resources/seccomp/x86_64-unknown-linux-musl.json Outdated Show resolved Hide resolved

firecracker-microvm deleted a comment from gbionescu Jun 2, 2021

acatangiu dismissed berciuliviu’s stale review via bdeb034 June 2, 2021 11:12

acatangiu force-pushed the clean-shutdown branch from 4c9f41d to bdeb034 Compare June 2, 2021 11:12

alindima previously approved these changes Jun 2, 2021

View reviewed changes

berciuliviu previously approved these changes Jun 2, 2021

View reviewed changes

acatangiu and others added 6 commits June 2, 2021 15:32

vmm: use native type 'i32' for exit-codes

03fb428

Signed-off-by: Adrian Catangiu <[email protected]>

acatangiu dismissed stale reviews from berciuliviu and alindima via dcf5f25 June 2, 2021 12:33

acatangiu force-pushed the clean-shutdown branch from bdeb034 to dcf5f25 Compare June 2, 2021 12:33

acatangiu added 5 commits June 2, 2021 16:25

seccomp: allow syscalls required at teardown

15d2d26

Signed-off-by: Adrian Catangiu <[email protected]>

signal_handler: make sure we always flush metrics

da540c5

Signed-off-by: Adrian Catangiu <[email protected]>

tests: always log errors (even for passing tests)

2eb0edd

Signed-off-by: Adrian Catangiu <[email protected]>

tests: update coverage and binary size limits

0ca7f7c

Signed-off-by: Adrian Catangiu <[email protected]>

acatangiu force-pushed the clean-shutdown branch from dcf5f25 to 0ca7f7c Compare June 2, 2021 13:25

berciuliviu approved these changes Jun 2, 2021

View reviewed changes

alindima approved these changes Jun 2, 2021

View reviewed changes

acatangiu merged commit 9a9f933 into firecracker-microvm:main Jun 2, 2021

This was referenced Jun 2, 2021

Check for Memory Leaks #119

Closed

[Tests] Periodically run Firecracker integration tests under sanitizers #1662

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Firecracker clean teardown - prerequisite for rust sanitizers and/or Valgrind #2599

Support Firecracker clean teardown - prerequisite for rust sanitizers and/or Valgrind #2599

Uh oh!

acatangiu commented May 28, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

berciuliviu commented May 31, 2021

Uh oh!

alindima left a comment

Uh oh!

Uh oh!

berciuliviu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

acatangiu commented Jun 2, 2021

Uh oh!

alindima left a comment

Uh oh!

berciuliviu left a comment

Uh oh!

AE1020 commented Jun 3, 2021

Uh oh!

Uh oh!

Support Firecracker clean teardown - prerequisite for rust sanitizers and/or Valgrind #2599

Support Firecracker clean teardown - prerequisite for rust sanitizers and/or Valgrind #2599

Uh oh!

Conversation

acatangiu commented May 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reason for This PR

Description of Changes

License Acceptance

PR Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

berciuliviu commented May 31, 2021

Uh oh!

alindima left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

berciuliviu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

acatangiu commented Jun 2, 2021

Uh oh!

alindima left a comment

Choose a reason for hiding this comment

Uh oh!

berciuliviu left a comment

Choose a reason for hiding this comment

Uh oh!

AE1020 commented Jun 3, 2021

Uh oh!

Uh oh!

acatangiu commented May 28, 2021 •

edited

Loading