Pass down resources to CRI #4113

marquiz · 2023-06-28T14:32:01Z

One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.

Issue link: Pass down resources to CRI #4112

Other comments:

Co-authored-by: Antti Kervinen <[email protected]>

marquiz · 2023-06-28T14:34:23Z

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

k8s-ci-robot · 2023-06-28T14:34:28Z

@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

haircommander · 2023-06-28T14:41:25Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2;
+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3;


should the keys here be a special type instead of unstructured?

I don't think it's possible to have smth like type ResourceName string in protobuf. Please correct me if I'm wrong

ping @haircommander, are you satisfied with the reply (close as resolved)?

keps/sig-node/4112-passdown-resources-to-cri/README.md

marquiz · 2023-07-18T07:59:40Z

/retitle Pass down resources to CRI

zvonkok · 2023-08-02T12:42:46Z

@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name.

zvonkok · 2023-08-02T12:44:17Z

@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

zvonkok · 2023-08-02T12:44:46Z

@bergwolf @egernst FYI

elezar

Thanks @marquiz.

It would be good to get more concrete details on the use cases that this would enable.
There is also the question of complex devices that are managed by device plugins where there isn't a clear mapping from the resources entry (e.g. vendor.com/xpu: 1) to the resources added to the container, or DRA where the associated resources.requests.claims entry is not mentioned.

elezar · 2023-08-02T13:02:04Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+
+#### Story 3
+
+As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI


Could you expand on this use case? How does extending the CRI translate to modifications in the OCI runtime specification which is interpreted by runc (or wrappers)?

The CRI changes (in this KEP) would not directly translate to anything in the OCI config. It's just "informational" that a possible hook/wrapper/plugin can then use to tweak the OCI config. Say you want to do customized cpu pinning in your plugin. I'll come up with some more flesh on this section...

@elezar I updated Story 3, PTAL

ping @elezar may I close this as resolved?

elezar · 2023-08-02T13:05:23Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+          requests:
+            cpu: 100m
+            memory: 100M
+            vendor.com/xpu: 1


For clarification: This does not indicate the properties of the resource that was actually allocated for a container requesting one of these devices?

That's very much true. I think I'll add a note about this in the KEP somewhere

@elezar I added a not about device plugin resources after this example. WDYT?

ping @elezar may I close this as resolved?

keps/sig-node/4112-passdown-resources-to-cri/README.md

aojea · 2023-08-03T08:37:22Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+     WindowsPodSandboxConfig windows = 9;
+
+    // Kubernetes resource spec of the containers in the pod.
+    PodResourceConfig pod_resources = 10;


@MikeZappa87 since you shared recently something along these lines for the networking capabilities, this KEP also means to interface with NRI

zvonkok · 2023-08-03T09:45:53Z

Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API.

marquiz · 2023-08-03T13:24:20Z

There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it.

zvonkok · 2023-08-03T14:50:07Z

@moshe010 @adrianchiris @shivamerla @cdesiniotis FYI

zvonkok · 2023-09-01T07:28:34Z

Since the DevicePlugin API supports CDI devices with this KEP: #4011 we should try to add more restrictions and requirements how we want to design this passthrough interface. @marquiz FYI

zvonkok · 2023-09-01T07:31:51Z

Just for reference linking this old KEP here: #3080, looks like there were some comments that CDI may not be the complete solution to some use-cases. We do not want to break anything by relying only on CDI when passing down the devices.

Apokleos · 2024-12-06T06:49:04Z

Hi folks, any updates of this KEP ? Any ideas to help move it forward ?

- update kep.yaml to target v1.33 - update references to kep-1287 and kep-2837

marquiz · 2024-12-10T18:54:52Z

Updated:

target milestone updated to v1.33
references to other KEPs removed from non-goals
references to other KEPs (In-Place Update of Pod Resources #1287, Pod level resources #2837) elsewhere in the proposal updated (e.g. removed the conditionals "if kep-X is accepted as proposed...")

The related changes in other referenced KEPs (#1287, #2837) are now merged which hopefully makes this slightly easier read. Those changes/KEPs (for Kubernetes v1.32) are still mentioned in the corresponding sections.

haircommander · 2025-01-31T15:58:42Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+Including the KubernetesResources in the ContainerConfig message serves
+multiple purposes:
+
+1. Catch changes that happen between pod sandbox creation and container


in place resize should be writing a message to the runtime that a pod or container had changed resources, right @tallclair ? I think that'd be better than having a runtime interpret diffs in container values...

Does it do so when the container hasn't yet been created. IIRC the thinking behind this case were the kata/coco use cases. The runtime or nri plugin whatever CAN detect changes if they need/want to.

Let's validate this flow. I don't think letting the runtime handle this infer this from diffs is correct.

zvonkok · 2025-02-07T12:29:05Z

@SergeyKanzhelev Are all your questions anwsered, are you happy with the current shape of the KEP? If yes, would you mind revoming the hold? Thanks!

johnbelamaric

Previous PRR still looks good.

marquiz · 2025-02-11T15:20:32Z

Previous PRR still looks good.

@SergeyKanzhelev PTAL

tallclair · 2025-02-12T00:49:42Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+any information about the resources (such as native resources, host devices,
+mounts, CDI devices etc) that will be required by the application. The CRI
+runtime only becomes aware of the resources piece by piece when containers of
+the pod are created (one-by-one).


Which of this information must come from the Kubelet, and which can be read from the k8s API? Resource requests & limits and volumes can all be read directly from the pod object. Do they need to be plumbed through the CRI, or can you just read from the k8s API instead?

The idea is to have clear separation between layers with normal clear top-to-bottom information flow between components. Kubelet knows current state of the object, so it provides whole state of that object down to lower layers. This allows lower layers to use information immediately in consistent, transactional way. This eliminates ways of delaying reply to CRI grpc transaction due to need of creating another grpc call to outside service(s). We want to eliminate scenarios where different projects are using some other backdoors or out-of-band communications to fetch information needed. Examples are different CNI providers that based on Sandbox ID are trying to connect to k8s api server to get whole pod object or similarly, some CNI providers are connecting back to kubelet with "hey, kubelet, you send me CRI message, but didn't provide enough information, but I know in your PodResources API you have that information for that pod, let me fetch it from there" manner.

So, the whole idea to have clean separation between layers:

kubelet is resposible to get higher level object from API server, prepare it to run, and give it to runtimes in full descriptive way.

Runtimes should not be doing calls from bottom layers back to upper layers (kubelet, api server) to get missing pieces.

Adding to that, for mounts we want more/different information than what is available on the Kubernetes API level. At least the host_path that will be mounted to the container.

Additional to the other points, in Kata/Confidential Containers we cannot handle subPath in volumeMounts because subPath is handled completely by the kubelet. https://github.com/kubernetes/kubernetes/blob/d7774fce9a7fcec890d7c0beffacd6ae34152b01/pkg/volume/util/subpath/subpath_linux.go#L175

c3d · 2025-04-02T09:35:30Z

@marquiz In the original description, what does "unmodified resource" mean? Specifically, does it mean "unmodified" relative to the originating specification, or resources that are not modified over the lifetime of a container / pod?

In the context of Kata Containers, there is some missing information that the runtime is not receiving and that is relevant because you need to be able to pass it down to a VM. That includes for example size limits for ephemeral storage. CC @gsid100.

gsid100 · 2025-04-02T10:17:12Z

@marquiz In the original description, what does "unmodified resource" mean? Specifically, does it mean "unmodified" relative to the originating specification, or resources that are not modified over the lifetime of a container / pod?

In the context of Kata Containers, there is some missing information that the runtime is not receiving and that is relevant because you need to be able to pass it down to a VM. That includes for example size limits for ephemeral storage. CC @gsid100.

Yes I can confirm, we noticed Kubelet doesn't pass the 'size' mount option of the ephemeral volume to CRI.
Is this being included in this PR? @marquiz

For an example, applying this spec:

{
    ....
    "kind": "Pod",
    ....
    "spec": {
      "containers": [
        {
          "image": "quay.io/buildah/stable:v1.30",
          ....
          "volumeMounts": [
            {
              "mountPath": "/var/lib/containers",
              "name": "container-storage"
            }
          ]
        }
      ],
      "runtimeClassName": "kata",
      "volumes": [
        {
          "emptyDir": {
            "medium": "Memory",
            "sizeLimit": "1.3G"
          },
          "name": "container-storage"
        }
      ]
    }
  },
  ...
}

results in this config.json at /run/containers/storage/overlay-containers/<CONTAINER_ID>/userdata/config.json
(no size option)

    "mounts": [
           ....
           {
                    "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
                    "type": "bind",
                    "source": "/var/lib/kubelet/pods/a7a08b7a-e064-47b9-9249-7ad57b0dc633/volumes/kubernetes.io~projected/kube-api-access-zr9w2",
                    "options": [
                            "rbind",
                            "rprivate",
                            "ro",
                            "bind"
                    ]
            }

kad · 2025-04-07T16:17:50Z

@marquiz In the original description, what does "unmodified resource" mean? Specifically, does it mean "unmodified" relative to the originating specification, or resources that are not modified over the lifetime of a container / pod?

That was more referring to information about pod.spec.resources.{requests,limits} for each individual containers, without assumed summarisation or assumed implementation calculations like requests.memory calculated to oom_score.

But it is good catch about ephemeral storage. This is not covered properly in CRI, and I think type of it also not present in CRI, if my memory serves me well.

marquiz · 2025-04-08T09:02:45Z

Yes I can confirm, we noticed Kubelet doesn't pass the 'size' mount option of the ephemeral volume to CRI.

Thanks for the feedback. I'll update the proposal. Need to think a bit how to incorporate the relevant information in the proposal.

zvonkok · 2025-04-08T23:54:18Z

Another use case I am currently working on is proper NUMA support in Kata/CoCo, and memory hot-plugging is impossible with NUMA-enabled VMs. I need to know the resource limits upfront for all things related to more sophisticated VM configurations. @marquiz @kad FYI

mythi · 2025-04-23T13:06:38Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+As a cluster administrator, I want to install an NRI plugin that does
+customized resource handling. I run kubelet with CPU manager and memory manager


A similar case but something that could be a story of its own stresses the goal for having better visibility to extended resources:

"As a cluster administrator, I want to install an NRI plugin that does limits enforcement for extended resources that kubelet does not know about (e.g., misc cgroup settings). These extended resources are consumed either via the pod spec or indirectly via Pod overhead".

Thx @mythi added this as a separate user story. WDYT?

Co-authored-by: Mikko Ylinen <[email protected]>

marquiz · 2025-04-29T16:46:34Z

Yes I can confirm, we noticed Kubelet doesn't pass the 'size' mount option of the ephemeral volume to CRI.

Thanks for the feedback. I'll update the proposal. Need to think a bit how to incorporate the relevant information in the proposal.

@c3d @gsid100 I think I was a bit too optimistic/wishful on this. My feeling is that the "volume management" is probably best kept out of the scope of this proposal for now. There is no concept of volumes in CRI. Kubelet prepares the volumes on it's own and passes down the path of (of the host system) which to be mounted inside the container (in mounts). [There is the exception of OCI image/artifact volumes but they're always read-only]. And if we start to talk about parameters of emptyDir volumes there are probably other ephemeral/non-ephemeral volume types that people also would like to include. So bringing the volumes in the discussion would significantly broaden the scope so I think it's best to have a separate KEP about that.

Thoughts?

dims · 2025-04-29T18:04:44Z

cc @tzneal

- update kep.yaml for v1.34 - slight rewording of risks/mitigations - remove SIDECAR_CONTAINER enum - align cri-api diff snippets with latest k/k master - update design details

k8s-ci-robot · 2025-05-02T14:39:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haircommander, marquiz, mikebrow
Once this PR has been reviewed and has the lgtm label, please assign dchen1107, wojtek-t for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

marquiz · 2025-05-02T14:41:28Z

Updated:

added new user story by @mythi
update kep.yaml for v1.34
slight rewording of risks/mitigations
remove SIDECAR_CONTAINER enum
align cri-api diff snippets with latest k/k master
update design details

mrunalp · 2025-05-05T21:37:49Z

keps/sig-node/4112-passdown-resources-to-cri/README.md

+resource allocation and thus better interoperability between containers inside
+the Pod. The runtime may use the information to make preparations for all
+upcoming containers of the pod. E.g.  setup all needed resources for a VM-based
+pod or prepare for optimal allocation of resources of all the containers of the


Can we provide more details around this aspect? It isn't clear to me from reading the KEP how the values will be used for optimal allocation. Maybe an example would help shed some light?

KEP: Initial version of the Pass down resources to CRI

7de612d

Co-authored-by: Antti Kervinen <[email protected]>

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jun 28, 2023

k8s-ci-robot requested review from dchen1107 and mrunalp June 28, 2023 14:32

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 28, 2023

k8s-ci-robot requested review from haircommander, mikebrow, zvonkok and kad June 28, 2023 14:34

haircommander reviewed Jun 28, 2023

View reviewed changes

keps/sig-node/4112-passdown-resources-to-cri/README.md Outdated Show resolved Hide resolved

marquiz mentioned this pull request Jul 18, 2023

Pass down resources to CRI #4112

Open

4 tasks

k8s-ci-robot changed the title ~~KEP: Initial version of the Pass down resources to CRI~~ Pass down resources to CRI Jul 18, 2023

KEP-4112: address review feedback from haircommander

bc8e299

elezar reviewed Aug 2, 2023

View reviewed changes

aojea reviewed Aug 3, 2023

View reviewed changes

zvonkok mentioned this pull request Sep 18, 2023

Add support Edit Container Mount for kata/runtime-rs using direct volume in K8S/CSI scenario cncf-tags/container-device-interface#162

Closed

zvonkok mentioned this pull request Nov 4, 2024

k8s subPath is not supported by Kata kata-containers/kata-containers#10487

Open

Manciukic mentioned this pull request Nov 29, 2024

[PCIe] Community Guidelines and Roadmap firecracker-microvm/firecracker#4894

Draft

KEP-4112: update for v1.33

b881855

- update kep.yaml to target v1.33 - update references to kep-1287 and kep-2837

zvonkok mentioned this pull request Jan 9, 2025

Road to Confidential Containers with GPUs for v<TARGET_VERSION> confidential-containers/confidential-containers#278

Open

71 tasks

haircommander reviewed Jan 31, 2025

View reviewed changes

zvonkok mentioned this pull request Feb 6, 2025

kata-shim: update default maximum memory for gust vm kata-containers/kata-containers#10837

Open

johnbelamaric reviewed Feb 10, 2025

View reviewed changes

tallclair reviewed Feb 12, 2025

View reviewed changes

tallclair mentioned this pull request Mar 27, 2025

Set memory.min or memory.low, without memory.high kubernetes/kubernetes#131077

Open

zvonkok mentioned this pull request Apr 14, 2025

gpu: Fix CDI annotations kata-containers/kata-containers#11150

Merged

This was referenced Apr 22, 2025

runtime: support SizeLimit for tmpfs emptyDir volume kata-containers/kata-containers#10898

Draft

runtime: tmpfs emptyDir 'SizeLimit' field isn't being handled kata-containers/kata-containers#10897

Open

mythi reviewed Apr 23, 2025

View reviewed changes

KEP-4112: add new user story

5411cbb

Co-authored-by: Mikko Ylinen <[email protected]>

KEP-4112: update

94624f6

- update kep.yaml for v1.34 - slight rewording of risks/mitigations - remove SIDECAR_CONTAINER enum - align cri-api diff snippets with latest k/k master - update design details

KEP-4112: update toc

71136e5

mrunalp reviewed May 5, 2025

View reviewed changes

		+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2;
		+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3;


		#### Story 3

		As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI

		As a cluster administrator, I want to install an NRI plugin that does
		customized resource handling. I run kubelet with CPU manager and memory manager

Pass down resources to CRI #4113

Are you sure you want to change the base?

Pass down resources to CRI #4113

Uh oh!

Conversation

marquiz commented Jun 28, 2023

Uh oh!

marquiz commented Jun 28, 2023

Uh oh!

k8s-ci-robot commented Jun 28, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marquiz commented Jul 18, 2023

Uh oh!

zvonkok commented Aug 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zvonkok commented Aug 2, 2023

Uh oh!

zvonkok commented Aug 2, 2023

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zvonkok commented Aug 3, 2023

Uh oh!

marquiz commented Aug 3, 2023

Uh oh!

zvonkok commented Aug 3, 2023

Uh oh!

zvonkok commented Sep 1, 2023

Uh oh!

zvonkok commented Sep 1, 2023

Uh oh!

Apokleos commented Dec 6, 2024

Uh oh!

marquiz commented Dec 10, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zvonkok commented Feb 7, 2025

Uh oh!

johnbelamaric left a comment

Choose a reason for hiding this comment

Uh oh!

marquiz commented Feb 11, 2025

Uh oh!

zvonkok commented Aug 2, 2023 •

edited

Loading

zvonkok Feb 13, 2025 •

edited

Loading

c3d commented Apr 2, 2025 •

edited

Loading

gsid100 commented Apr 2, 2025 •

edited

Loading