-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP 4381: DRA structured parameters: updates, promotion to GA #5333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
in this KEP. | ||
in this KEP. It's documentation and code describe best practices for | ||
developing a DRA driver. It is used by the | ||
[example DRA driver](https://github.com/kubernetes-sigs/dra-example-driver) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: we are considering renaming the repository to "dra-driver-example" (kubernetes/org#5597 (comment)).
1.32. Only the `v1` version is served by default. `v1beta1` must remain | ||
supported for encoding/decoding. Other betas remain available | ||
as long as required by the [deprecation policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/) | ||
but need to be enabled explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a good path towards removing v1beta1
, which is unfortunate because the conversion is ugly due to the different structure. If anyone has suggestions then I am all ears...
Not a blocker for GA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What blocks us from removing the v1beta1
API once we have switched the serialization version to v1beta2
(ref kubernetes/kubernetes#129889)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing, upgraded clusters may have objects stored with v1beta1. There is https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/2330-migrating-api-objects-to-latest-storage-version but it seems stuck at alpha.
depends on the number of requests per ResourceClaim, number of ResourceClaims, | ||
number of published devices in ResourceSlices, and the complexity of the | ||
requests. Other checks besides CEL evaluation also take time (usage checks, | ||
match attributes, etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the important part for GA. I think we want the mitigation with a timeout, but on the other hand that is new functionality.
The default duration is also entirely up for debate.
I left it open if and how the timeout can be disabled. I suppose a timeout <= 0 should mean "no timeout". Do we also want a separate feature gate (beta, on by default)? Modifying a feature gate is easier than modifying the scheduler configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do think we need the timeout. As described in kubernetes/kubernetes#131730, there are certain situations where the allocation can end up taking a very long time, which would block scheduling of all pods. The situations where this happens probably aren't common, but also not completely crazy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mortent: what's your opinion about having a separate feature gate?
@sanposhiho: does the proposal look correct to you? We discussed this before on Slack.
I now also have an implementation ready for review: kubernetes/kubernetes#132033
It doesn't have an extra feature gate yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout seems like a good approach to me.
If this timeout would be able to cause some pods being unschedulable, we probably should have a feature gate. However, I don't have a strong opinion on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also leaning towards having a feature gate.
|
||
A DRA driver may use | ||
[seamless upgrades](https://github.com/kubernetes/kubernetes/blob/582b421393d0fad2ad4a83feba88977ac4434662/pkg/kubelet/pluginmanager/pluginwatcher/README.md#seamless-upgrade) | ||
to ensure that there is always a running driver instance on a node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided against copying the content from that document into this KEP. I think having it in one place is enough.
|
||
The following metric mirrors the `csi_operations_seconds`: | ||
The following metric mirrors the `csi_operations_seconds`, i.e. | ||
provides information about gRPC calls issued by the kubelet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did not promote any of the metrics from alpha to beta when graduating the feature to beta. I think I had asked SIG Instrumentation about the metrics stability lifecycle and whether it needs to be tied to the feature lifecyle and I believe the answer was "no" - but I don't find anymore where that was.
This makes sense to me: adoption of metrics collection is probably trailing adoption of the feature itself, so blocking graduation on feedback for metrics is making the chicken-and-egg-problem even worse.
Many of our existing metrics are still alpha, so this wouldn't be unusual. But perhaps that is just a (very?!) common oversight?
This is also how it has been handled in other recent feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had some discussions in the past to potentially have metric graduation lag one release behind feature graduation for the exact reason you mentioned. I think we left the discussion there and never added a requirement to graduate metrics.
But having most metrics alpha is definitely a problem and it looks like leaving graduation up to component owners didn't really work. Maybe we should add a new criteria that all metrics should be graduated to beta before a feature can be graduated to GA. Then we could have a job catching metrics that are perma beta similar to what apimachinery has for APIs. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a reasonable policy.
For DRA, I think we could graduate the existing metrics to beta in 1.34. The new one I'd like to keep in alpha and then graduate later, which is not quite according to that proposed policy, but it's just a proposal at this time, right? 😅
Either way, I'll create tracking issues for the metrics so that we don't forget.
- Metric name: `dra_resource_claims_in_use` | ||
- Description: The number of ResourceClaims that are currently in use on the node, by driver name (`driver_name` label value) and across all drivers (special value `<any>` for `driver_name`). Note that the sum of all by-driver counts is not the total number of in-use ResourceClaims because the same ResourceClaim might use devices from different drivers. Instead, use the count for the `<any>` driver_name. | ||
- Type: Histogram | ||
- Labels: `driver_name` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation is ready, see kubernetes/kubernetes#131641. I'm holding it for review of the KEP update.
The design and implementation of this metrics was reviewed by @richabanker.
It starts out as alpha, which is consistent with the other metrics (see above).
@@ -2706,7 +2565,7 @@ calls happens inside the E2E test. | |||
|
|||
All tests that don't involve actually running a Pod can become part of | |||
conformance testing. Those tests that run Pods cannot be because CDI support in | |||
runtimes is not required. | |||
runtimes and plugin support in the kubelet are not required for conformance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference: #sig-node
discussion around standardizing plugin support
- When zero, remove the driver. | ||
|
||
- Testing: An E2E test covers the expected retry mechanism in kubelet when | ||
`NodeUnprepareResources` fails intermittently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a reference to a separate feature is useful here and does not make that feature a blocker for GA because it just improves how admins deal with this, it's not required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kubernetes/sig-scheduling-leads: who is going to take over as reviewer and approver?
Should @MaciekPytel still be listed for SIG Autoscaling?
1.32. Only the `v1` version is served by default. `v1beta1` must remain | ||
supported for encoding/decoding. Other betas remain available | ||
as long as required by the [deprecation policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/) | ||
but need to be enabled explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What blocks us from removing the v1beta1
API once we have switched the serialization version to v1beta2
(ref kubernetes/kubernetes#129889)?
depends on the number of requests per ResourceClaim, number of ResourceClaims, | ||
number of published devices in ResourceSlices, and the complexity of the | ||
requests. Other checks besides CEL evaluation also take time (usage checks, | ||
match attributes, etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do think we need the timeout. As described in kubernetes/kubernetes#131730, there are certain situations where the allocation can end up taking a very long time, which would block scheduling of all pods. The situations where this happens probably aren't common, but also not completely crazy.
DRA drivers should implement both because support for v1alpha4 might get | ||
removed. | ||
Versions v1beta1 and v1 are supported by kubelet. Both are identical. | ||
DRA drivers should implement both because support for v1beta will get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be v1beta1
(instead of just v1beta
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
Some of the existing content was a bit stale, for example the v1beta2 API changes were missing. Seamless upgrades were already added in 1.33. New for 1.34 are the dra_resource_claims_in_use metrics and the Filter timeout. They are part of filling gaps identified for GA. With those gaps closed, the criteria for GA should be satisfied, so promotion to GA gets proposed for 1.34.
8f98b4b
to
90f13dd
Compare
When started with DRA enabled, the scheduler should check whether DRA is also | ||
enabled in the API server. Without such an explicit check, syncing the informer | ||
caches would fail when the feature gate is enabled but the API group is | ||
disabled. How to implement such a check reliably still needs to be determined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided against implementing this. Instead the usual policy applies: admins must ensure that a feature is enabled in the apiserver before enabling it elsewhere. This is not specific to this KEP and thus not documented explicitly.
/milestone 1.34 |
@mrunalp: The provided milestone is not valid for this repository. Milestones in this repository: [ Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/milestone v1.34 |
/wg device-management |
Therefore the scheduler plugin supports a configurable timeout that is | ||
applied to the entire Filter call for each node. In case of a timeout, | ||
Filter returns Unschedulable. If Filter succeeds for some other node(s), | ||
scheduling continues with those. If Filter fails for all of them, | ||
the Pod is placed in the unschedulable queue. It will get checked again | ||
if changes in e.g. ResourceSlices or ResourceClaims indicate that | ||
another scheduling attempt might succeed. If this fails repeatedly, | ||
exponential backoff slows down future attempts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there corner cases that would cause a pod that is schedulable on some node, being constantly rejected because of a timeout? Let's say filter on some node X takes always much time and eventually pod can be scheduled there, but it won't because of timeout. If that could be a case, can you say it explicitly in the KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably it's worth to add to the KEP a sentence saying that if neither ResourceSlices
nor ResourceClaims
changes, the pod won't be retried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there corner cases that would cause a pod that is schedulable on some node, being constantly rejected because of a timeout?
Yes, that's the risk here. The default timeout was chosen very high to make this unlikely, but it could still happen. I'll add some more words around this.
if neither ResourceSlices nor ResourceClaims changes, the pod won't be retried.
Will add.
depends on the number of requests per ResourceClaim, number of ResourceClaims, | ||
number of published devices in ResourceSlices, and the complexity of the | ||
requests. Other checks besides CEL evaluation also take time (usage checks, | ||
match attributes, etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timeout seems like a good approach to me.
If this timeout would be able to cause some pods being unschedulable, we probably should have a feature gate. However, I don't have a strong opinion on it.
Some of the existing content was a bit stale, for example the v1beta2 API changes were missing. Seamless upgrades were already added in 1.33.
New for 1.34 are the dra_resource_claims_in_use metrics and the Filter timeout. They are part of filling gaps identified for GA. With those gaps closed, the criteria for GA should be satisfied, so promotion to GA gets proposed for 1.34.