Skip to content

KEP 4381: DRA structured parameters: updates, promotion to GA #5333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented May 23, 2025

  • One-line PR description: DRA structured parameters: updates, promotion to GA
  • Other comments:

Some of the existing content was a bit stale, for example the v1beta2 API changes were missing. Seamless upgrades were already added in 1.33.

New for 1.34 are the dra_resource_claims_in_use metrics and the Filter timeout. They are part of filling gaps identified for GA. With those gaps closed, the criteria for GA should be satisfied, so promotion to GA gets proposed for 1.34.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pohly
Once this PR has been reviewed and has the lgtm label, please assign derekwaynecarr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from dchen1107 and mrunalp May 23, 2025 06:30
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 23, 2025
in this KEP.
in this KEP. It's documentation and code describe best practices for
developing a DRA driver. It is used by the
[example DRA driver](https://github.com/kubernetes-sigs/dra-example-driver)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference: we are considering renaming the repository to "dra-driver-example" (kubernetes/org#5597 (comment)).

1.32. Only the `v1` version is served by default. `v1beta1` must remain
supported for encoding/decoding. Other betas remain available
as long as required by the [deprecation policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/)
but need to be enabled explicitly.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a good path towards removing v1beta1, which is unfortunate because the conversion is ugly due to the different structure. If anyone has suggestions then I am all ears...

Not a blocker for GA.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What blocks us from removing the v1beta1 API once we have switched the serialization version to v1beta2 (ref kubernetes/kubernetes#129889)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing, upgraded clusters may have objects stored with v1beta1. There is https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/2330-migrating-api-objects-to-latest-storage-version but it seems stuck at alpha.

depends on the number of requests per ResourceClaim, number of ResourceClaims,
number of published devices in ResourceSlices, and the complexity of the
requests. Other checks besides CEL evaluation also take time (usage checks,
match attributes, etc.).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the important part for GA. I think we want the mitigation with a timeout, but on the other hand that is new functionality.

The default duration is also entirely up for debate.

I left it open if and how the timeout can be disabled. I suppose a timeout <= 0 should mean "no timeout". Do we also want a separate feature gate (beta, on by default)? Modifying a feature gate is easier than modifying the scheduler configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do think we need the timeout. As described in kubernetes/kubernetes#131730, there are certain situations where the allocation can end up taking a very long time, which would block scheduling of all pods. The situations where this happens probably aren't common, but also not completely crazy.

Copy link
Contributor Author

@pohly pohly May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mortent: what's your opinion about having a separate feature gate?

@sanposhiho: does the proposal look correct to you? We discussed this before on Slack.

I now also have an implementation ready for review: kubernetes/kubernetes#132033

It doesn't have an extra feature gate yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout seems like a good approach to me.

If this timeout would be able to cause some pods being unschedulable, we probably should have a feature gate. However, I don't have a strong opinion on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also leaning towards having a feature gate.


A DRA driver may use
[seamless upgrades](https://github.com/kubernetes/kubernetes/blob/582b421393d0fad2ad4a83feba88977ac4434662/pkg/kubelet/pluginmanager/pluginwatcher/README.md#seamless-upgrade)
to ensure that there is always a running driver instance on a node.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against copying the content from that document into this KEP. I think having it in one place is enough.


The following metric mirrors the `csi_operations_seconds`:
The following metric mirrors the `csi_operations_seconds`, i.e.
provides information about gRPC calls issued by the kubelet:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not promote any of the metrics from alpha to beta when graduating the feature to beta. I think I had asked SIG Instrumentation about the metrics stability lifecycle and whether it needs to be tied to the feature lifecyle and I believe the answer was "no" - but I don't find anymore where that was.

cc @dgrisonnet @richabanker

This makes sense to me: adoption of metrics collection is probably trailing adoption of the feature itself, so blocking graduation on feedback for metrics is making the chicken-and-egg-problem even worse.

Many of our existing metrics are still alpha, so this wouldn't be unusual. But perhaps that is just a (very?!) common oversight?

This is also how it has been handled in other recent feature

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some discussions in the past to potentially have metric graduation lag one release behind feature graduation for the exact reason you mentioned. I think we left the discussion there and never added a requirement to graduate metrics.

But having most metrics alpha is definitely a problem and it looks like leaving graduation up to component owners didn't really work. Maybe we should add a new criteria that all metrics should be graduated to beta before a feature can be graduated to GA. Then we could have a job catching metrics that are perma beta similar to what apimachinery has for APIs. wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a reasonable policy.

For DRA, I think we could graduate the existing metrics to beta in 1.34. The new one I'd like to keep in alpha and then graduate later, which is not quite according to that proposed policy, but it's just a proposal at this time, right? 😅

Either way, I'll create tracking issues for the metrics so that we don't forget.

- Metric name: `dra_resource_claims_in_use`
- Description: The number of ResourceClaims that are currently in use on the node, by driver name (`driver_name` label value) and across all drivers (special value `<any>` for `driver_name`). Note that the sum of all by-driver counts is not the total number of in-use ResourceClaims because the same ResourceClaim might use devices from different drivers. Instead, use the count for the `<any>` driver_name.
- Type: Histogram
- Labels: `driver_name`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is ready, see kubernetes/kubernetes#131641. I'm holding it for review of the KEP update.

The design and implementation of this metrics was reviewed by @richabanker.

It starts out as alpha, which is consistent with the other metrics (see above).

@@ -2706,7 +2565,7 @@ calls happens inside the E2E test.

All tests that don't involve actually running a Pod can become part of
conformance testing. Those tests that run Pods cannot be because CDI support in
runtimes is not required.
runtimes and plugin support in the kubelet are not required for conformance.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- When zero, remove the driver.

- Testing: An E2E test covers the expected retry mechanism in kubelet when
`NodeUnprepareResources` fails intermittently.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a reference to a separate feature is useful here and does not make that feature a blocker for GA because it just improves how admins deal with this, it's not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kubernetes/sig-scheduling-leads: who is going to take over as reviewer and approver?

Should @MaciekPytel still be listed for SIG Autoscaling?

1.32. Only the `v1` version is served by default. `v1beta1` must remain
supported for encoding/decoding. Other betas remain available
as long as required by the [deprecation policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/)
but need to be enabled explicitly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What blocks us from removing the v1beta1 API once we have switched the serialization version to v1beta2 (ref kubernetes/kubernetes#129889)?

depends on the number of requests per ResourceClaim, number of ResourceClaims,
number of published devices in ResourceSlices, and the complexity of the
requests. Other checks besides CEL evaluation also take time (usage checks,
match attributes, etc.).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also do think we need the timeout. As described in kubernetes/kubernetes#131730, there are certain situations where the allocation can end up taking a very long time, which would block scheduling of all pods. The situations where this happens probably aren't common, but also not completely crazy.

DRA drivers should implement both because support for v1alpha4 might get
removed.
Versions v1beta1 and v1 are supported by kubelet. Both are identical.
DRA drivers should implement both because support for v1beta will get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be v1beta1 (instead of just v1beta)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@pohly pohly mentioned this pull request May 28, 2025
27 tasks
Some of the existing content was a bit stale, for example the
v1beta2 API changes were missing. Seamless upgrades were already
added in 1.33.

New for 1.34 are the dra_resource_claims_in_use metrics and
the Filter timeout. They are part of filling gaps identified for
GA. With those gaps closed, the criteria for GA should be satisfied,
so promotion to GA gets proposed for 1.34.
@pohly pohly force-pushed the dra-structured-parameters-ga branch from 8f98b4b to 90f13dd Compare May 30, 2025 09:17
When started with DRA enabled, the scheduler should check whether DRA is also
enabled in the API server. Without such an explicit check, syncing the informer
caches would fail when the feature gate is enabled but the API group is
disabled. How to implement such a check reliably still needs to be determined.
Copy link
Contributor Author

@pohly pohly May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided against implementing this. Instead the usual policy applies: admins must ensure that a feature is enabled in the apiserver before enabling it elsewhere. This is not specific to this KEP and thus not documented explicitly.

@mrunalp
Copy link
Contributor

mrunalp commented May 31, 2025

/milestone 1.34
/label lead-opted-in

@k8s-ci-robot
Copy link
Contributor

@mrunalp: The provided milestone is not valid for this repository. Milestones in this repository: [v1.25, v1.27, v1.28, v1.29, v1.30, v1.31, v1.32, v1.33, v1.34, v1.35]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.34
/label lead-opted-in

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mrunalp
Copy link
Contributor

mrunalp commented May 31, 2025

/milestone v1.34

@k8s-ci-robot k8s-ci-robot added this to the v1.34 milestone May 31, 2025
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label May 31, 2025
@pohly
Copy link
Contributor Author

pohly commented Jun 3, 2025

/wg device-management

@k8s-ci-robot k8s-ci-robot added the wg/device-management Categorizes an issue or PR as relevant to WG Device Management. label Jun 3, 2025
@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Jun 3, 2025
Comment on lines +2025 to +2032
Therefore the scheduler plugin supports a configurable timeout that is
applied to the entire Filter call for each node. In case of a timeout,
Filter returns Unschedulable. If Filter succeeds for some other node(s),
scheduling continues with those. If Filter fails for all of them,
the Pod is placed in the unschedulable queue. It will get checked again
if changes in e.g. ResourceSlices or ResourceClaims indicate that
another scheduling attempt might succeed. If this fails repeatedly,
exponential backoff slows down future attempts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there corner cases that would cause a pod that is schedulable on some node, being constantly rejected because of a timeout? Let's say filter on some node X takes always much time and eventually pod can be scheduled there, but it won't because of timeout. If that could be a case, can you say it explicitly in the KEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And probably it's worth to add to the KEP a sentence saying that if neither ResourceSlices nor ResourceClaims changes, the pod won't be retried.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there corner cases that would cause a pod that is schedulable on some node, being constantly rejected because of a timeout?

Yes, that's the risk here. The default timeout was chosen very high to make this unlikely, but it could still happen. I'll add some more words around this.

if neither ResourceSlices nor ResourceClaims changes, the pod won't be retried.

Will add.

depends on the number of requests per ResourceClaim, number of ResourceClaims,
number of published devices in ResourceSlices, and the complexity of the
requests. Other checks besides CEL evaluation also take time (usage checks,
match attributes, etc.).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout seems like a good approach to me.

If this timeout would be able to cause some pods being unschedulable, we probably should have a feature gate. However, I don't have a strong opinion on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

6 participants