Target GA for recover from Volume Expansion failure feature #5353

gnufied · 2025-05-29T22:32:52Z

k8s-ci-robot · 2025-05-29T22:51:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please assign jpbetz, xing-yang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-storage/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

keps/sig-storage/1790-recover-resize-failure/kep.yaml

jsafrane · 2025-05-30T12:54:04Z

keps/sig-storage/1790-recover-resize-failure/kep.yaml

@@ -22,11 +22,12 @@ replaces:
 superseded-by:

 latest-milestone: "v1.32"
-stage: "alpha"


(starting a new "thread") I would also expect https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1790-recover-resize-failure#graduation-criteria having GA graduation criteria.

jsafrane · 2025-05-30T12:55:06Z

keps/sig-storage/1790-recover-resize-failure/kep.yaml

@@ -22,11 +22,12 @@ replaces:
 superseded-by:

 latest-milestone: "v1.32"
-stage: "alpha"


(starting another unrelated "thread")
Should we mention the new annotation added in kubernetes/kubernetes#131907 ? The enhancement itself does not aim at RWX volumes, still, an API (annotation is an API) deserves some note somewhere. I guess that can be solved in the annotation "API review".

kannon92 · 2025-05-30T18:39:42Z

Shadowing @deads2k on this. I had one item that jumped at me reading the PRR.

Was this metric added?

Are there any missing metrics that would be useful to have to improve observability if this feature?
We are planning to add new counter metrics that will record success and failure of recovery operations. In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.

Tentative name of metric is - operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}

The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often and hence it should be okay to use name of PVs that were recovered this way.

kannon92 · 2025-05-30T18:44:30Z

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

The answers were a bit vague so we should revisit that section and comment on the scalability questions with more information.

gnufied · 2025-05-30T18:51:19Z

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

gnufied · 2025-05-30T19:12:06Z

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

kannon92 · 2025-05-30T19:15:43Z

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

It would be worth cleaning up the KEP then. I think you mention using that metric in a few places in the PRR.

kannon92 · 2025-05-30T19:16:34Z

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

Maybe you can expand on what you mean by recovery.

deads2k · 2025-06-02T12:57:30Z

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

gnufied · 2025-06-03T16:46:43Z

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

So there is csi_sidecar_operations_seconds{method_name="/csi.v1.Controller/ControllerExpandVolume}, csi_operations_seconds{method_name="/csi.v1.Node/NodeExpandVolume"} and storage_operation_duration_seconds{operation_name="volume_fs_resize"}, which track these operations already on both controller and node side. The csi_xxx metrics track grpc error codes, so if either Controller or node side expansion is failing, it will be recorded appropriately and a metric emitted.

Some of these metrics have evolved since KEP (This KEP is now 5 years old) was originally written, I will update the KEP with newer metric names.

gnufied · 2025-06-03T16:59:43Z

keps/sig-storage/1790-recover-resize-failure/README.md

@@ -140,6 +142,17 @@ This will allow external-resizer to recover safely from node expansion failures

 ![New flow kubelet](./Expanding volume - Kubelet Loop.png)

+### Handling of RWX volumes that don't require node expansion


@jsafrane section about annotation stuff.

gnufied · 2025-06-03T17:00:53Z

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

I have reworded those, ptal.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 29, 2025

k8s-ci-robot requested review from saad-ali and xing-yang May 29, 2025 22:32

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 29, 2025

gnufied mentioned this pull request May 29, 2025

Support recovery from volume expansion failure #1790

Open

12 tasks

gnufied force-pushed the bump-recovery-ga branch 2 times, most recently from c94f4b5 to bd6bae1 Compare May 29, 2025 22:39

Target GA for recover from Volume Expansion failure feature

d52c7d4

gnufied force-pushed the bump-recovery-ga branch from bd6bae1 to d52c7d4 Compare May 29, 2025 22:51

jsafrane reviewed May 30, 2025

View reviewed changes

Address comments about GA milestone

316f5aa

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 2, 2025

Add update for annotation API

698cd3d

Update metric names

fc95a65

gnufied commented Jun 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Target GA for recover from Volume Expansion failure feature #5353

Target GA for recover from Volume Expansion failure feature #5353

gnufied commented May 29, 2025

Uh oh!

k8s-ci-robot commented May 29, 2025

Uh oh!

Uh oh!

jsafrane May 30, 2025

Uh oh!

jsafrane May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

gnufied commented May 30, 2025

Uh oh!

gnufied commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

deads2k commented Jun 2, 2025

Uh oh!

gnufied commented Jun 3, 2025

Uh oh!

gnufied Jun 3, 2025

Uh oh!

gnufied commented Jun 3, 2025

Uh oh!

Uh oh!

		@@ -140,6 +142,17 @@ This will allow external-resizer to recover safely from node expansion failures

		![New flow kubelet](./Expanding volume - Kubelet Loop.png)

		### Handling of RWX volumes that don't require node expansion

Target GA for recover from Volume Expansion failure feature #5353

Are you sure you want to change the base?

Target GA for recover from Volume Expansion failure feature #5353

Conversation

gnufied commented May 29, 2025

Uh oh!

k8s-ci-robot commented May 29, 2025

Uh oh!

Uh oh!

jsafrane May 30, 2025

Choose a reason for hiding this comment

Uh oh!

jsafrane May 30, 2025

Choose a reason for hiding this comment

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

gnufied commented May 30, 2025

Uh oh!

gnufied commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

kannon92 commented May 30, 2025

Uh oh!

deads2k commented Jun 2, 2025

Uh oh!

gnufied commented Jun 3, 2025

Uh oh!

gnufied Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

gnufied commented Jun 3, 2025

Uh oh!

Uh oh!