Skip to content

[WIP] Nested Virtualization: KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATE #322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

phip1611
Copy link

@phip1611 phip1611 commented May 20, 2025

Summary of the PR

For live-migration and state save/resume in Cloud Hypervisor [PR] and other VMMs using nested virtualization, the relevant IOCTLs were missing so far to get the state from/into KVM.

Handling nested state is a little more complex than anticipated due to the "dynamic buffer" (structure with header reporting the length) that KVM uses in its interface. Creating a nice, safe, and easy interface took multiple iterations.. To mitigate this, my solution is a helper type acting as stack buffer that is good enough for most use-cases. To provide users more flexibility, the new functions consumes AsRef/AsMut<[u8]>, in case a user wants to use a type that is serde compatible over the new KvmNestedStateBuffer.

Hints for Reviewers

  • Please review this commit-by-commit.
  • This is the result of multiple internal iterations; I'm however open to design discussions!

Steps to Undraft

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

  • All commits in this PR have Signed-Off-By trailers (with
    git commit -s), and the commit message has max 60 characters for the
    summary and max 75 characters for each description line.
  • All added/changed functionality has a corresponding unit/integration
    test.
  • All added/changed public-facing functionality has entries in the "Upcoming
    Release" section of CHANGELOG.md (if no such section exists, please create one).
  • Any newly added unsafe code is properly documented.

@roypat
Copy link
Member

roypat commented May 20, 2025

Heya!

Handling nested state is a little more complex than anticipated due to the "dynamic buffer" (structure with header reporting the length) that KVM uses in its interface.

If I'm reading the kernel code correctly, then since commit 6ca00dfafda7 ("KVM: x86: Modify struct kvm_nested_state to have explicit fields for data") this isn't actually true anymore: The u8 flexible array member that kvm_nested_state used to have got removed and replaced with a strongly typed union field with a statically known size. The only reason it was declared as a zero-sized member was backwards compatibility with Linux <5.2. But here we don't really need to worry about that, so we can just have the ioctl wrappers take &[mut] kvm_nested_state as generated by bindgen (potentially with a check that kvm_nested_state.size == sizeof::<kvm_nested_state>() for sanity and protected against memory unsafety in case the structure does ever get extended again).

cc @bonzini who probably knows more about this than me though

@phip1611
Copy link
Author

phip1611 commented May 20, 2025

Please note that kvm_nested_state from the current bindgen is too small and only contains the "header". Various checks in my code verify this. The actual size of the nested state also differs between VMX and SVM, I also verified this on my developer machine.

The effective size can be verified during runtime using kvm.check_extension_int(Cap::NestedState)

My introduced helper type solves this using stack allocated memory. But for SVM, we waste a page every time as the SVM state is smaller.

@roypat
Copy link
Member

roypat commented May 20, 2025

Please note that kvm_nested_state from the current bindgen is too small and only contains the "header".

argh, you're right, the bindgen version is indeed rubbish, hadnt actually looked at bindings.rs.

Various checks in my code verify this. The actual size of the nested state also differs between VMX and SVM, I also verified this on my developer machine.

The effective size can be verified during runtime using kvm.check_extension_int(Cap::NestedState)

My introduced helper type solves this using stack allocated memory. But for SVM, we waste a page every time as the SVM state is smaller.

yes this is all fine - I misinterpreted your PR description, sorry. I thought you were treating the kvm_nested_state struct as having a flexible array member (I got confused by the &[u8] conversions).

As for the serde thingy, I think the diff below is more in line with how serde support for other structures is handled in this crate, and allows the ioctl wrappers to be implemented in terms struct KvmNestedStateBuffer instead of [u8] slices:

diff --git a/kvm-bindings/src/x86_64/bindings.rs b/kvm-bindings/src/x86_64/bindings.rs
index d49ee9a4f4f3..248b4a8d76ac 100644
--- a/kvm-bindings/src/x86_64/bindings.rs
+++ b/kvm-bindings/src/x86_64/bindings.rs
@@ -1999,6 +1999,10 @@ impl Default for kvm_vmx_nested_state_data {
 }
 #[repr(C)]
 #[derive(Debug, Default, Copy, Clone, PartialEq)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::IntoBytes, zerocopy::Immutable, zerocopy::FromBytes)
+)]
 pub struct kvm_vmx_nested_state_hdr {
     pub vmxon_pa: __u64,
     pub vmcs12_pa: __u64,
@@ -2009,6 +2013,10 @@ pub struct kvm_vmx_nested_state_hdr {
 }
 #[repr(C)]
 #[derive(Debug, Default, Copy, Clone, PartialEq)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::IntoBytes, zerocopy::Immutable, zerocopy::FromBytes)
+)]
 pub struct kvm_vmx_nested_state_hdr__bindgen_ty_1 {
     pub flags: __u16,
 }
@@ -2065,6 +2073,10 @@ impl Default for kvm_svm_nested_state_data {
 }
 #[repr(C)]
 #[derive(Debug, Default, Copy, Clone, PartialEq)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::IntoBytes, zerocopy::Immutable, zerocopy::FromBytes)
+)]
 pub struct kvm_svm_nested_state_hdr {
     pub vmcb_pa: __u64,
 }
@@ -2087,6 +2099,10 @@ pub struct kvm_nested_state {
 }
 #[repr(C)]
 #[derive(Copy, Clone)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::Immutable, zerocopy::FromBytes)
+)]
 pub union kvm_nested_state__bindgen_ty_1 {
     pub vmx: kvm_vmx_nested_state_hdr,
     pub svm: kvm_svm_nested_state_hdr,
diff --git a/kvm-bindings/src/x86_64/mod.rs b/kvm-bindings/src/x86_64/mod.rs
index 08e1bab89f26..9d9b69150fdc 100644
--- a/kvm-bindings/src/x86_64/mod.rs
+++ b/kvm-bindings/src/x86_64/mod.rs
@@ -9,6 +9,8 @@ pub mod fam_wrappers;
 #[cfg(feature = "serde")]
 mod serialize;
 
+pub mod nested;
+
 pub use self::bindings::*;
 #[cfg(feature = "fam-wrappers")]
 pub use self::fam_wrappers::*;
diff --git a/kvm-bindings/src/x86_64/nested.rs b/kvm-bindings/src/x86_64/nested.rs
new file mode 100644
index 000000000000..1298984f7538
--- /dev/null
+++ b/kvm-bindings/src/x86_64/nested.rs
@@ -0,0 +1,59 @@
+use ::{kvm_nested_state__bindgen_ty_1, KVM_STATE_NESTED_VMX_VMCS_SIZE};
+use KVM_STATE_NESTED_SVM_VMCB_SIZE;
+
+
+#[derive(Clone, Copy)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::Immutable, zerocopy::FromBytes)
+)]
+#[repr(C)]
+pub union kvm_nested_state__data {
+    pub vmx: kvm_vmx_nested_state_data,
+    pub svm: kvm_svm_nested_state_data,
+}
+
+impl Default for kvm_nested_state__data {
+    fn default() -> Self {
+        let mut s = ::std::mem::MaybeUninit::<Self>::uninit();
+        unsafe {
+            ::std::ptr::write_bytes(s.as_mut_ptr(), 0, 1);
+            s.assume_init()
+        }
+    }
+}
+
+#[derive(Clone, Copy)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::IntoBytes, zerocopy::Immutable, zerocopy::FromBytes)
+)]
+#[repr(C)]
+pub struct kvm_vmx_nested_state_data {
+    pub vmcs12: [u8; KVM_STATE_NESTED_VMX_VMCS_SIZE as usize],
+    pub shadow_vmcs12: [u8; KVM_STATE_NESTED_VMX_VMCS_SIZE as usize],
+}
+
+#[derive(Clone, Copy)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::IntoBytes, zerocopy::Immutable, zerocopy::FromBytes)
+)]
+#[repr(C)]
+pub struct kvm_svm_nested_state_data {
+    pub vmcb12: [u8; KVM_STATE_NESTED_SVM_VMCB_SIZE as usize],
+}
+
+#[derive(Clone, Copy, Default)]
+#[cfg_attr(
+    feature = "serde",
+    derive(zerocopy::IntoBytes, zerocopy::Immutable, zerocopy::FromBytes)
+)]
+#[repr(C)]
+pub struct kvm_nested_state {
+    pub flags: u16,
+    pub format: u16,
+    pub size: u32,
+    pub hdr: kvm_nested_state__bindgen_ty_1,
+    pub data: kvm_nested_state__data,
+}
diff --git a/kvm-bindings/src/x86_64/serialize.rs b/kvm-bindings/src/x86_64/serialize.rs
index 11d90d533936..e0e109e15cd7 100644
--- a/kvm-bindings/src/x86_64/serialize.rs
+++ b/kvm-bindings/src/x86_64/serialize.rs
@@ -10,6 +10,8 @@ use bindings::{
 use fam_wrappers::kvm_xsave2;
 use serde::{Deserialize, Deserializer, Serialize, Serializer};
 use zerocopy::{transmute, FromBytes, FromZeros, Immutable, IntoBytes};
+use kvm_nested_state__bindgen_ty_1;
+use nested::{kvm_nested_state, kvm_nested_state__data};
 
 serde_impls!(
     kvm_regs,
@@ -31,7 +33,8 @@ serde_impls!(
     kvm_cpuid2,
     kvm_xsave,
     kvm_xsave2,
-    kvm_irqchip
+    kvm_irqchip,
+    kvm_nested_state
 );
 
 // SAFETY: zerocopy's derives explicitly disallow deriving for unions where
@@ -94,10 +97,35 @@ unsafe impl Immutable for kvm_irqchip__bindgen_ty_1 {
     }
 }
 
+// SAFETY: zerocopy's derives explicitly disallow deriving for unions where
+// the fields have different sizes, due to the smaller fields having padding.
+// Miri however does not complain about these implementations (e.g. about
+// reading the "padding" for one union field as valid data for a bigger one)
+unsafe impl IntoBytes for kvm_nested_state__bindgen_ty_1 {
+    fn only_derive_is_allowed_to_implement_this_trait()
+    where
+        Self: Sized
+    {
+    }
+}
+
+// SAFETY: zerocopy's derives explicitly disallow deriving for unions where
+// the fields have different sizes, due to the smaller fields having padding.
+// Miri however does not complain about these implementations (e.g. about
+// reading the "padding" for one union field as valid data for a bigger one)
+unsafe impl IntoBytes for kvm_nested_state__data {
+    fn only_derive_is_allowed_to_implement_this_trait()
+    where
+        Self: Sized
+    {
+    }
+}
+
 #[cfg(test)]
 mod tests {
     use super::*;
     use bindings::*;
+    use nested;
 
     fn is_serde<T: Serialize + for<'de> Deserialize<'de> + Default>() {
         let config = bincode::config::standard();
@@ -152,6 +180,7 @@ mod tests {
         is_serde::<kvm_xcrs>();
         is_serde::<kvm_irqchip>();
         is_serde::<kvm_mp_state>();
+        is_serde::<nested::kvm_nested_state>();
     }
 
     fn is_serde_json<T: Serialize + for<'de> Deserialize<'de> + Default>() {
@@ -184,5 +213,6 @@ mod tests {
         is_serde_json::<kvm_xcrs>();
         is_serde_json::<kvm_irqchip>();
         is_serde_json::<kvm_mp_state>();
+        is_serde_json::<nested::kvm_nested_state>();
     }
 }

@phip1611 phip1611 force-pushed the nested-state branch 3 times, most recently from 7b3a492 to ee6c962 Compare June 3, 2025 13:30
@phip1611
Copy link
Author

phip1611 commented Jun 3, 2025

Thanks for your guidance, I gave it another shot. I think the outcome is quite nice. Thoughts?

phip1611 added 3 commits June 3, 2025 15:33
This type is a helper, making the use of get_nested_state() and
set_nested_state(), which are added in the next commit, much more
convenient.

Note that this type's name uses UpperCamelCase as it is not just
a plain old data type but actually contains some logic: the
`size` field is properly initialized.

Effectively, KVM expects a dynamic buffer with a header reporting
the size to either store the nested state in or load it from. As such
data structures with a certain alignment are challenging to work with
(note that Vec<u8> always have an alignment of 1 but we need 4),
this type sacrifices a little memory overhead in some cases for
better UX.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
These calls are relevant for live-migration and state save/resume
when nested virtualization is used.

I tested everything with a nested guest in Cloud Hypervisor, but these
patches are not yet upstream.

Signed-off-by: Philipp Schuster <[email protected]>
On-behalf-of: SAP [email protected]
Copy link
Member

@roypat roypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this looks good to me! Could you just squash the commits a little bit so that we dont first introduce struct kvm_nested_state and then modify it in two follow up commits to rename / add the non-exhaustiveness?

Comment on lines +78 to +79
// Prevent constructor bypass in public API.
_priv: core::marker::PhantomData<()>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use #[non_exhaustive] on the struct for this?

@@ -18,6 +18,8 @@ extern crate serde;
#[cfg(feature = "serde")]
extern crate zerocopy;

extern crate core;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? :o

Copy link
Author

@phip1611 phip1611 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because kvm-bindings still uses Rust edition 2015; needed to access core::mem.

Strictly speaking, the crate doesn't specify any Rust edition and falls backto 2015

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whew. I'll submit a PR to get us onto edition 2024 tmrw, no reason for us to be on 2015 except that no one ever bothered to update it I guess (we dont really have a MSRV policy around here either)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants