KEP 1287: Instrumentation for in-place pod resize #5340

natasha41575 · 2025-05-23T21:32:44Z

One-line PR description:
- Add section with details about the metrics we plan to instrument for IPPR

Issue link: In-Place Update of Pod Resources #1287

k8s-ci-robot · 2025-05-23T21:32:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: natasha41575
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

natasha41575 · 2025-05-23T21:35:00Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+Labels: 
+- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, 
+we increment the counter multiple times, once for each. This means that a single pod update changing multiple
+resource types will be considered multiple requests for this metric. 


ig this is a little weird but I'm not sure if the alternatives are better. We also already have apiserver_request_total{resource=pods,subresource=resize} if we just want the raw total number of resize requests to the api server

Yeah, I agree. Maybe worth asking the sig-instrumentation folks for advice?

natasha41575 · 2025-05-23T21:35:51Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, 
+we increment the counter multiple times, once for each. This means that a single pod update changing multiple
+resource types will be considered multiple requests for this metric. 
+- `operation_type` -  whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.


I'm assuming requests / limits can be added to a container but I don't actually know if that's true?? (I know kubernetes/kubernetes#127143 is adding support to remove them)

Yeah, can be added. Except memory limits because that would count as a memory limit decrease (which we don't currently allow), but we'll lift that restriction.

natasha41575 · 2025-05-23T21:37:23Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+
+#### `kubelet_pod_resize_requests_total` 
+
+This metric tracks the total number of resize requests observed by the Kubelet, counted at the pod level.


cc @ndixita

I don't have all the context but we might want to revisit or reuse this metric in the context of pod-level resources when resize of pod-level resources is supported

Without pod-level resources, is this just based on the net-change across all containers? What happens if 2 containers are resized, but the net change is 0?

I'm wondering whether we should skip this metric for now, and only record it in the context of pod-level resources?

natasha41575 · 2025-05-23T21:42:12Z

/assign @tallclair

tallclair · 2025-05-30T23:47:29Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+Labels: 
+- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, 
+we increment the counter multiple times, once for each. This means that a single pod update changing multiple
+resource types will be considered multiple requests for this metric. 


Yeah, I agree. Maybe worth asking the sig-instrumentation folks for advice?

tallclair · 2025-05-30T23:51:28Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, 
+we increment the counter multiple times, once for each. This means that a single pod update changing multiple
+resource types will be considered multiple requests for this metric. 
+- `operation_type` -  whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.


Yeah, can be added. Except memory limits because that would count as a memory limit decrease (which we don't currently allow), but we'll lift that restriction.

tallclair · 2025-05-30T23:53:04Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+
+#### `kubelet_pod_resize_requests_total` 
+
+This metric tracks the total number of resize requests observed by the Kubelet, counted at the pod level.


Without pod-level resources, is this just based on the net-change across all containers? What happens if 2 containers are resized, but the net change is 0?

I'm wondering whether we should skip this metric for now, and only record it in the context of pod-level resources?

tallclair · 2025-05-30T23:56:19Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request, 
+we increment the counter multiple times, once for each.
+- `operation_type` -  whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.


This seems like useful data, but I think including these dimensions will require tracking some historical context for the pod resources? At the time we deem the resize complete, we don't currently know what changed. WDYT?

tallclair · 2025-05-30T23:58:10Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+
+This metric is recorded as a gauge.
+
+#### `kubelet_pod_infeasible_resize_total`


Should we count deferred resizes too?

tallclair · 2025-05-30T23:59:41Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but 
+later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry as


Do we have sufficient information to distinguish between a deferred resize that was accepted, and a deferred resize that was overwritten with a new feasible size?

tallclair · 2025-05-31T00:00:13Z

keps/sig-node/1287-in-place-update-pod-resources/README.md

+resizes that we should fix.
+
+Labels:
+  - `retry_reason` - whether the resize was accepted through the timed retry or explicitly signaled. Possible values: `timed`, `signaled`.


What does this mean?

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 23, 2025

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels May 23, 2025

k8s-ci-robot requested review from dchen1107 and derekwaynecarr May 23, 2025 21:32

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 23, 2025

natasha41575 commented May 23, 2025

View reviewed changes

KEP 1287: Instrumentation for in-place pod resize

ca1a000

natasha41575 force-pushed the metrics branch from 145a141 to ca1a000 Compare May 23, 2025 21:40

k8s-ci-robot assigned tallclair May 23, 2025

tallclair reviewed May 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KEP 1287: Instrumentation for in-place pod resize #5340

KEP 1287: Instrumentation for in-place pod resize #5340

Uh oh!

natasha41575 commented May 23, 2025

Uh oh!

k8s-ci-robot commented May 23, 2025

Uh oh!

natasha41575 May 23, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

natasha41575 May 23, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

natasha41575 May 23, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

natasha41575 commented May 23, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

tallclair May 30, 2025

Uh oh!

tallclair May 31, 2025

Uh oh!

Uh oh!


		#### `kubelet_pod_resize_requests_total`

		This metric tracks the total number of resize requests observed by the Kubelet, counted at the pod level.


		This metric is recorded as a gauge.

		#### `kubelet_pod_infeasible_resize_total`

		This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but
		later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry as

KEP 1287: Instrumentation for in-place pod resize #5340

Are you sure you want to change the base?

KEP 1287: Instrumentation for in-place pod resize #5340

Uh oh!

Conversation

natasha41575 commented May 23, 2025

Uh oh!

k8s-ci-robot commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natasha41575 commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!