-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Pass down resources to CRI #4113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Pass down resources to CRI #4113
Conversation
marquiz
commented
Jun 28, 2023
- One-line PR description: KEP for extending the CRI API to pass down unmodified resource information from the kubelet to the CRI runtime.
- Issue link: Pass down resources to CRI #4112
- Other comments:
Co-authored-by: Antti Kervinen <[email protected]>
@marquiz: GitHub didn't allow me to request PR reviews from the following users: fidencio. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 2; | ||
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the keys here be a special type instead of unstructured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible to have smth like type ResourceName string
in protobuf. Please correct me if I'm wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @haircommander, are you satisfied with the reply (close as resolved)?
/retitle Pass down resources to CRI |
@marquiz We need to check how this will work with DRA and CDI devices. If we have enough information to know which devices need to be added to the sandbox just by the resource claim name. |
@marquiz There is already some code for sandbox sizing, accumulation of resources CPU and Memory for reference: kubernetes/kubernetes#104886 that we leverage in Kata, what are the plans for this interface, deprecate or keep it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @marquiz.
It would be good to get more concrete details on the use cases that this would enable.
There is also the question of complex devices that are managed by device plugins where there isn't a clear mapping from the resources
entry (e.g. vendor.com/xpu: 1
) to the resources added to the container, or DRA where the associated resources.requests.claims
entry is not mentioned.
|
||
#### Story 3 | ||
|
||
As a cluster administrator, I want to install an OCI hook/runc wrapper/NRI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on this use case? How does extending the CRI translate to modifications in the OCI runtime specification which is interpreted by runc (or wrappers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CRI changes (in this KEP) would not directly translate to anything in the OCI config. It's just "informational" that a possible hook/wrapper/plugin can then use to tweak the OCI config. Say you want to do customized cpu pinning in your plugin. I'll come up with some more flesh on this section...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar I updated Story 3, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @elezar may I close this as resolved?
requests: | ||
cpu: 100m | ||
memory: 100M | ||
vendor.com/xpu: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarification: This does not indicate the properties of the resource that was actually allocated for a container requesting one of these devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very much true. I think I'll add a note about this in the KEP somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elezar I added a not about device plugin resources after this example. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @elezar may I close this as resolved?
WindowsPodSandboxConfig windows = 9; | ||
+ | ||
+ // Kubernetes resource spec of the containers in the pod. | ||
+ PodResourceConfig pod_resources = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeZappa87 since you shared recently something along these lines for the networking capabilities, this KEP also means to interface with NRI
Another point to consider is how we're going to integrate or not these enhancements with the new containerd Sandbox API. |
@zvonkok that one is just the native resources and gives the resources in the "obfuscated" form i.e. not telling the actual reqeusts/limits (plus it's for Linux resources only). I think we wouldn't, or even couldn't, touch this, i.e. keep it. |
Just for reference linking this old KEP here: #3080, looks like there were some comments that CDI may not be the complete solution to some use-cases. We do not want to break anything by relying only on CDI when passing down the devices. |
Hi folks, any updates of this KEP ? Any ideas to help move it forward ? |
- update kep.yaml to target v1.33 - update references to kep-1287 and kep-2837
Updated:
The related changes in other referenced KEPs (#1287, #2837) are now merged which hopefully makes this slightly easier read. Those changes/KEPs (for Kubernetes v1.32) are still mentioned in the corresponding sections. |
Including the KubernetesResources in the ContainerConfig message serves | ||
multiple purposes: | ||
|
||
1. Catch changes that happen between pod sandbox creation and container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in place resize should be writing a message to the runtime that a pod or container had changed resources, right @tallclair ? I think that'd be better than having a runtime interpret diffs in container values...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it do so when the container hasn't yet been created. IIRC the thinking behind this case were the kata/coco use cases. The runtime or nri plugin whatever CAN detect changes if they need/want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's validate this flow. I don't think letting the runtime handle this infer this from diffs is correct.
@SergeyKanzhelev Are all your questions anwsered, are you happy with the current shape of the KEP? If yes, would you mind revoming the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous PRR still looks good.
@SergeyKanzhelev PTAL |
any information about the resources (such as native resources, host devices, | ||
mounts, CDI devices etc) that will be required by the application. The CRI | ||
runtime only becomes aware of the resources piece by piece when containers of | ||
the pod are created (one-by-one). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which of this information must come from the Kubelet, and which can be read from the k8s API? Resource requests & limits and volumes can all be read directly from the pod object. Do they need to be plumbed through the CRI, or can you just read from the k8s API instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to have clear separation between layers with normal clear top-to-bottom information flow between components. Kubelet knows current state of the object, so it provides whole state of that object down to lower layers. This allows lower layers to use information immediately in consistent, transactional way. This eliminates ways of delaying reply to CRI grpc transaction due to need of creating another grpc call to outside service(s). We want to eliminate scenarios where different projects are using some other backdoors or out-of-band communications to fetch information needed. Examples are different CNI providers that based on Sandbox ID are trying to connect to k8s api server to get whole pod object or similarly, some CNI providers are connecting back to kubelet with "hey, kubelet, you send me CRI message, but didn't provide enough information, but I know in your PodResources API you have that information for that pod, let me fetch it from there" manner.
So, the whole idea to have clean separation between layers:
- kubelet is resposible to get higher level object from API server, prepare it to run, and give it to runtimes in full descriptive way.
- Runtimes should not be doing calls from bottom layers back to upper layers (kubelet, api server) to get missing pieces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to that, for mounts we want more/different information than what is available on the Kubernetes API level. At least the host_path
that will be mounted to the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional to the other points, in Kata/Confidential Containers we cannot handle subPath
in volumeMounts
because subPath is handled completely by the kubelet. https://github.com/kubernetes/kubernetes/blob/d7774fce9a7fcec890d7c0beffacd6ae34152b01/pkg/volume/util/subpath/subpath_linux.go#L175
@marquiz In the original description, what does "unmodified resource" mean? Specifically, does it mean "unmodified" relative to the originating specification, or resources that are not modified over the lifetime of a container / pod? In the context of Kata Containers, there is some missing information that the runtime is not receiving and that is relevant because you need to be able to pass it down to a VM. That includes for example size limits for ephemeral storage. CC @gsid100. |
Yes I can confirm, we noticed Kubelet doesn't pass the 'size' mount option of the ephemeral volume to CRI. For an example, applying this spec:
results in this config.json at
|
That was more referring to information about But it is good catch about ephemeral storage. This is not covered properly in CRI, and I think type of it also not present in CRI, if my memory serves me well. |
Thanks for the feedback. I'll update the proposal. Need to think a bit how to incorporate the relevant information in the proposal. |
As a cluster administrator, I want to install an NRI plugin that does | ||
customized resource handling. I run kubelet with CPU manager and memory manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A similar case but something that could be a story of its own stresses the goal for having better visibility to extended resources:
"As a cluster administrator, I want to install an NRI plugin that does limits enforcement for extended resources that kubelet does not know about (e.g., misc
cgroup settings). These extended resources are consumed either via the pod spec or indirectly via Pod overhead
".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @mythi added this as a separate user story. WDYT?
Co-authored-by: Mikko Ylinen <[email protected]>
@c3d @gsid100 I think I was a bit too optimistic/wishful on this. My feeling is that the "volume management" is probably best kept out of the scope of this proposal for now. There is no concept of volumes in CRI. Kubelet prepares the volumes on it's own and passes down the path of (of the host system) which to be mounted inside the container (in Thoughts? |
cc @tzneal |
- update kep.yaml for v1.34 - slight rewording of risks/mitigations - remove SIDECAR_CONTAINER enum - align cri-api diff snippets with latest k/k master - update design details
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haircommander, marquiz, mikebrow The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Updated:
|
resource allocation and thus better interoperability between containers inside | ||
the Pod. The runtime may use the information to make preparations for all | ||
upcoming containers of the pod. E.g. setup all needed resources for a VM-based | ||
pod or prepare for optimal allocation of resources of all the containers of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we provide more details around this aspect? It isn't clear to me from reading the KEP how the values will be used for optimal allocation. Maybe an example would help shed some light?