Skip to content

Target GA for recover from Volume Expansion failure feature #5353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented May 29, 2025

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 29, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 29, 2025
@gnufied gnufied force-pushed the bump-recovery-ga branch 2 times, most recently from c94f4b5 to bd6bae1 Compare May 29, 2025 22:39
@gnufied gnufied force-pushed the bump-recovery-ga branch from bd6bae1 to d52c7d4 Compare May 29, 2025 22:51
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please assign jpbetz, xing-yang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -22,11 +22,12 @@ replaces:
superseded-by:

latest-milestone: "v1.32"
stage: "alpha"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -22,11 +22,12 @@ replaces:
superseded-by:

latest-milestone: "v1.32"
stage: "alpha"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(starting another unrelated "thread")
Should we mention the new annotation added in kubernetes/kubernetes#131907 ? The enhancement itself does not aim at RWX volumes, still, an API (annotation is an API) deserves some note somewhere. I guess that can be solved in the annotation "API review".

@kannon92
Copy link
Contributor

Shadowing @deads2k on this. I had one item that jumped at me reading the PRR.

Was this metric added?

Are there any missing metrics that would be useful to have to improve observability if this feature?
We are planning to add new counter metrics that will record success and failure of recovery operations. In cases where recovery fails, the counter will forever be increasing until an admin action resolves the error.

Tentative name of metric is - operation_operation_volume_recovery_total{state='success', volume_name='pvc-abce'}

The reason of using PV name as a label is - we do not expect this feature to be used in a cluster very often and hence it should be okay to use name of PVs that were recovered this way.

@kannon92
Copy link
Contributor

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

The answers were a bit vague so we should revisit that section and comment on the scalability questions with more information.

@gnufied
Copy link
Member Author

gnufied commented May 30, 2025

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

@gnufied
Copy link
Member Author

gnufied commented May 30, 2025

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

@kannon92
Copy link
Contributor

Was this metric added?

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

The metric item was tentative, so not sure if blocker for GA.

It would be worth cleaning up the KEP then. I think you mention using that metric in a few places in the PRR.

@kannon92
Copy link
Contributor

It would be worth expanding on https://github.com/kubernetes/enhancements/blob/d52c7d43a67a3f337e9c8a1869b820e8204af5dd/keps/sig-storage/1790-recover-resize-failure/README.md#scalability as part of the GA graduation.

I am adding some more information for GA graduation criteria but scalability requirements as such has not changed between beta and GA. Which bits you found vague? I answered couple of questions with - "Potentially Yes", because answer depends on whether a recovery was attempted.

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

Maybe you can expand on what you mean by recovery.

@deads2k
Copy link
Contributor

deads2k commented Jun 2, 2025

No, we didn't add the metric. I was just discussing with storage folks and while the idea sounds nice on paper, there is very little practical benefit from this particular metric. From operational point of view, I can't see a reason why will an admin be interested in this metric. There are already other resizing related metrics, which work well enough.

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 2, 2025
@gnufied
Copy link
Member Author

gnufied commented Jun 3, 2025

What is the alternative method we recommend to a cluster-admin to determine when resize operations are failing on a particular node and/or failing across all nodes.

So there is csi_sidecar_operations_seconds{method_name="/csi.v1.Controller/ControllerExpandVolume}, csi_operations_seconds{method_name="/csi.v1.Node/NodeExpandVolume"} and storage_operation_duration_seconds{operation_name="volume_fs_resize"}, which track these operations already on both controller and node side. The csi_xxx metrics track grpc error codes, so if either Controller or node side expansion is failing, it will be recorded appropriately and a metric emitted.

Some of these metrics have evolved since KEP (This KEP is now 5 years old) was originally written, I will update the KEP with newer metric names.

@@ -140,6 +142,17 @@ This will allow external-resizer to recover safely from node expansion failures

![New flow kubelet](./Expanding volume - Kubelet Loop.png)

### Handling of RWX volumes that don't require node expansion
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane section about annotation stuff.

@gnufied
Copy link
Member Author

gnufied commented Jun 3, 2025

The potentially yes jumped out at me because it sounded like we have not yet investigated if this would occur.

I have reworded those, ptal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants