Skip to content

Pass down resources to CRI #4113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

marquiz
Copy link
Contributor

@marquiz marquiz commented Jun 28, 2023

  • One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jun 28, 2023
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 28, 2023
@marquiz
Copy link
Contributor Author

marquiz commented Jun 28, 2023

@k8s-ci-robot
Copy link
Contributor

@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @haircommander @mikebrow @zvonkok @fidencio @kad

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines 289 to 290
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2;
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the keys here be a special type instead of unstructured?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible to have smth like type ResourceName string in protobuf. Please correct me if I'm wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @haircommander, are you satisfied with the reply (close as resolved)?

@marquiz marquiz mentioned this pull request Jul 18, 2023
4 tasks
@marquiz
Copy link
Contributor Author

marquiz commented Jul 18, 2023

/retitle Pass down resources to CRI

@k8s-ci-robot k8s-ci-robot changed the title KEP: Initial version of the Pass down resources to CRI Pass down resources to CRI Jul 18, 2023
@zvonkok
Copy link

zvonkok commented Aug 2, 2023

@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name.

@zvonkok
Copy link

zvonkok commented Aug 2, 2023

@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

@zvonkok
Copy link

zvonkok commented Aug 2, 2023

@bergwolf @egernst FYI

Copy link
Contributor

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @marquiz.

It would be good to get more concrete details on the use cases that this would enable.
There is also the question of complex devices that are managed by device plugins where there isn't a clear mapping from the resources entry (e.g. vendor.com/xpu: 1) to the resources added to the container, or DRA where the associated resources.requests.claims entry is not mentioned.


#### Story 3

As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand on this use case? How does extending the CRI translate to modifications in the OCI runtime specification which is interpreted by runc (or wrappers)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRI changes (in this KEP) would not directly translate to anything in the OCI config. It's just "informational" that a possible hook/wrapper/plugin can then use to tweak the OCI config. Say you want to do customized cpu pinning in your plugin. I'll come up with some more flesh on this section...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar I updated Story 3, PTAL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @elezar may I close this as resolved?

requests:
cpu: 100m
memory: 100M
vendor.com/xpu: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarification: This does not indicate the properties of the resource that was actually allocated for a container requesting one of these devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very much true. I think I'll add a note about this in the KEP somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar I added a not about device plugin resources after this example. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @elezar may I close this as resolved?

WindowsPodSandboxConfig windows = 9;
+
+ // Kubernetes resource spec of the containers in the pod.
+ PodResourceConfig pod_resources = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeZappa87 since you shared recently something along these lines for the networking capabilities, this KEP also means to interface with NRI

@zvonkok
Copy link

zvonkok commented Aug 3, 2023

Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API.

@marquiz
Copy link
Contributor Author

marquiz commented Aug 3, 2023

There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it?

@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it.

@zvonkok
Copy link

zvonkok commented Aug 3, 2023

@zvonkok
Copy link

zvonkok commented Sep 1, 2023

Since the DevicePlugin API supports CDI devices with this KEP: #4011 we should try to add more restrictions and requirements how we want to design this passthrough interface. @marquiz FYI

@zvonkok
Copy link

zvonkok commented Sep 1, 2023

Just for reference linking this old KEP here: #3080, looks like there were some comments that CDI may not be the complete solution to some use-cases. We do not want to break anything by relying only on CDI when passing down the devices.

@Apokleos
Copy link

Apokleos commented Dec 6, 2024

Hi folks, any updates of this KEP ? Any ideas to help move it forward ?

- update kep.yaml to target v1.33
- update references to kep-1287 and kep-2837
@marquiz
Copy link
Contributor Author

marquiz commented Dec 10, 2024

Updated:

The related changes in other referenced KEPs (#1287, #2837) are now merged which hopefully makes this slightly easier read. Those changes/KEPs (for Kubernetes v1.32) are still mentioned in the corresponding sections.

Including the KubernetesResources in the ContainerConfig message serves
multiple purposes:

1. Catch changes that happen between pod sandbox creation and container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in place resize should be writing a message to the runtime that a pod or container had changed resources, right @tallclair ? I think that'd be better than having a runtime interpret diffs in container values...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it do so when the container hasn't yet been created. IIRC the thinking behind this case were the kata/coco use cases. The runtime or nri plugin whatever CAN detect changes if they need/want to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's validate this flow. I don't think letting the runtime handle this infer this from diffs is correct.

@zvonkok
Copy link

zvonkok commented Feb 7, 2025

@SergeyKanzhelev Are all your questions anwsered, are you happy with the current shape of the KEP? If yes, would you mind revoming the hold? Thanks!

Copy link
Member

@johnbelamaric johnbelamaric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous PRR still looks good.

@marquiz
Copy link
Contributor Author

marquiz commented Feb 11, 2025

Previous PRR still looks good.

@SergeyKanzhelev PTAL

any information about the resources (such as native resources, host devices,
mounts, CDI devices etc) that will be required by the application. The CRI
runtime only becomes aware of the resources piece by piece when containers of
the pod are created (one-by-one).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which of this information must come from the Kubelet, and which can be read from the k8s API? Resource requests & limits and volumes can all be read directly from the pod object. Do they need to be plumbed through the CRI, or can you just read from the k8s API instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to have clear separation between layers with normal clear top-to-bottom information flow between components. Kubelet knows current state of the object, so it provides whole state of that object down to lower layers. This allows lower layers to use information immediately in consistent, transactional way. This eliminates ways of delaying reply to CRI grpc transaction due to need of creating another grpc call to outside service(s). We want to eliminate scenarios where different projects are using some other backdoors or out-of-band communications to fetch information needed. Examples are different CNI providers that based on Sandbox ID are trying to connect to k8s api server to get whole pod object or similarly, some CNI providers are connecting back to kubelet with "hey, kubelet, you send me CRI message, but didn't provide enough information, but I know in your PodResources API you have that information for that pod, let me fetch it from there" manner.

So, the whole idea to have clean separation between layers:

  • kubelet is resposible to get higher level object from API server, prepare it to run, and give it to runtimes in full descriptive way.
  • Runtimes should not be doing calls from bottom layers back to upper layers (kubelet, api server) to get missing pieces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to that, for mounts we want more/different information than what is available on the Kubernetes API level. At least the host_path that will be mounted to the container.

Copy link

@zvonkok zvonkok Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional to the other points, in Kata/Confidential Containers we cannot handle subPath in volumeMounts because subPath is handled completely by the kubelet. https://github.com/kubernetes/kubernetes/blob/d7774fce9a7fcec890d7c0beffacd6ae34152b01/pkg/volume/util/subpath/subpath_linux.go#L175

@c3d
Copy link

c3d commented Apr 2, 2025

@marquiz In the original description, what does "unmodified resource" mean? Specifically, does it mean "unmodified" relative to the originating specification, or resources that are not modified over the lifetime of a container / pod?

In the context of Kata Containers, there is some missing information that the runtime is not receiving and that is relevant because you need to be able to pass it down to a VM. That includes for example size limits for ephemeral storage. CC @gsid100.

@gsid100
Copy link

gsid100 commented Apr 2, 2025

@marquiz In the original description, what does "unmodified resource" mean? Specifically, does it mean "unmodified" relative to the originating specification, or resources that are not modified over the lifetime of a container / pod?

In the context of Kata Containers, there is some missing information that the runtime is not receiving and that is relevant because you need to be able to pass it down to a VM. That includes for example size limits for ephemeral storage. CC @gsid100.

Yes I can confirm, we noticed Kubelet doesn't pass the 'size' mount option of the ephemeral volume to CRI.
Is this being included in this PR? @marquiz

For an example, applying this spec:

{
    ....
    "kind": "Pod",
    ....
    "spec": {
      "containers": [
        {
          "image": "quay.io/buildah/stable:v1.30",
          ....
          "volumeMounts": [
            {
              "mountPath": "/var/lib/containers",
              "name": "container-storage"
            }
          ]
        }
      ],
      "runtimeClassName": "kata",
      "volumes": [
        {
          "emptyDir": {
            "medium": "Memory",
            "sizeLimit": "1.3G"
          },
          "name": "container-storage"
        }
      ]
    }
  },
  ...
}

results in this config.json at /run/containers/storage/overlay-containers/<CONTAINER_ID>/userdata/config.json
(no size option)

    "mounts": [
           ....
           {
                    "destination": "/var/run/secrets/kubernetes.io/serviceaccount",
                    "type": "bind",
                    "source": "/var/lib/kubelet/pods/a7a08b7a-e064-47b9-9249-7ad57b0dc633/volumes/kubernetes.io~projected/kube-api-access-zr9w2",
                    "options": [
                            "rbind",
                            "rprivate",
                            "ro",
                            "bind"
                    ]
            }

@kad
Copy link
Member

kad commented Apr 7, 2025

@marquiz In the original description, what does "unmodified resource" mean? Specifically, does it mean "unmodified" relative to the originating specification, or resources that are not modified over the lifetime of a container / pod?

That was more referring to information about pod.spec.resources.{requests,limits} for each individual containers, without assumed summarisation or assumed implementation calculations like requests.memory calculated to oom_score.

But it is good catch about ephemeral storage. This is not covered properly in CRI, and I think type of it also not present in CRI, if my memory serves me well.

@marquiz
Copy link
Contributor Author

marquiz commented Apr 8, 2025

Yes I can confirm, we noticed Kubelet doesn't pass the 'size' mount option of the ephemeral volume to CRI.

Thanks for the feedback. I'll update the proposal. Need to think a bit how to incorporate the relevant information in the proposal.

@zvonkok
Copy link

zvonkok commented Apr 8, 2025

Another use case I am currently working on is proper NUMA support in Kata/CoCo, and memory hot-plugging is impossible with NUMA-enabled VMs. I need to know the resource limits upfront for all things related to more sophisticated VM configurations. @marquiz @kad FYI

Comment on lines +277 to +278
As a cluster administrator, I want to install an NRI plugin that does
customized resource handling. I run kubelet with CPU manager and memory manager
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A similar case but something that could be a story of its own stresses the goal for having better visibility to extended resources:

"As a cluster administrator, I want to install an NRI plugin that does limits enforcement for extended resources that kubelet does not know about (e.g., misc cgroup settings). These extended resources are consumed either via the pod spec or indirectly via Pod overhead".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx @mythi added this as a separate user story. WDYT?

Co-authored-by: Mikko Ylinen <[email protected]>
@marquiz
Copy link
Contributor Author

marquiz commented Apr 29, 2025

Yes I can confirm, we noticed Kubelet doesn't pass the 'size' mount option of the ephemeral volume to CRI.

Thanks for the feedback. I'll update the proposal. Need to think a bit how to incorporate the relevant information in the proposal.

@c3d @gsid100 I think I was a bit too optimistic/wishful on this. My feeling is that the "volume management" is probably best kept out of the scope of this proposal for now. There is no concept of volumes in CRI. Kubelet prepares the volumes on it's own and passes down the path of (of the host system) which to be mounted inside the container (in mounts). [There is the exception of OCI image/artifact volumes but they're always read-only]. And if we start to talk about parameters of emptyDir volumes there are probably other ephemeral/non-ephemeral volume types that people also would like to include. So bringing the volumes in the discussion would significantly broaden the scope so I think it's best to have a separate KEP about that.

Thoughts?

@dims
Copy link
Member

dims commented Apr 29, 2025

cc @tzneal

- update kep.yaml for v1.34
- slight rewording of risks/mitigations
- remove SIDECAR_CONTAINER enum
- align cri-api diff snippets with latest k/k master
- update design details
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haircommander, marquiz, mikebrow
Once this PR has been reviewed and has the lgtm label, please assign dchen1107, wojtek-t for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@marquiz
Copy link
Contributor Author

marquiz commented May 2, 2025

Updated:

  • added new user story by @mythi
  • update kep.yaml for v1.34
  • slight rewording of risks/mitigations
  • remove SIDECAR_CONTAINER enum
  • align cri-api diff snippets with latest k/k master
  • update design details

resource allocation and thus better interoperability between containers inside
the Pod. The runtime may use the information to make preparations for all
upcoming containers of the pod. E.g. setup all needed resources for a VM-based
pod or prepare for optimal allocation of resources of all the containers of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide more details around this aspect? It isn't clear to me from reading the KEP how the values will be used for optimal allocation. Maybe an example would help shed some light?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status
Projects
None yet
Development

Successfully merging this pull request may close these issues.