-
-
Notifications
You must be signed in to change notification settings - Fork 713
Add support for deferred checkpoints #1721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
WalkthroughThis update refines the checkpoint management system across multiple components. In the coordinator, the Checkpointer and TaskCoordinator classes have been adjusted to use improved abort controller handling, configurable delay parameters, and enhanced error logging. The webapp services now include more granular handling of checkpoint reasons and dependency resumption. Database schema changes introduce a new column to track when task dependencies are resumed. These coordinated changes streamline checkpoint operations and dependency management within the system. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant CP as Checkpointer
Client->>CP: checkpointAndPush(opts, delayMs)
CP->>CP: Initialize runAbortControllers with AbortSignal & abort method
alt Delay specified
CP->>CP: Wait for delayMs (log duration)
end
CP->>CP: Invoke #checkpointAndPushWithBackoff(opts, { delayMs, signal })
alt Abort signal triggered
CP-->>Client: Return abort notification
else Operation proceeds
CP-->>Client: Return CheckpointData
end
sequenceDiagram
participant Service as ResumeTaskDependencyService
participant DB as Database
Service->>Service: Call #setDependencyToResumedOnce(dependency)
alt Dependency already resumed
Service-->>Service: Log debug, exit early
else Dependency updated
Service->>DB: Update resumedAt field in TaskRunDependency
Service->>Service: Enqueue messages for resuming task dependency
end
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (8)
apps/webapp/app/v3/services/resumeTaskDependency.server.ts (1)
137-155
: Well-defined setter forresumedAt
, but consider potential race conditions.
TheupdateMany
approach usingid
andresumedAt: null
is effective. Just be mindful of possible concurrency scenarios if multiple calls to the same dependency occur in parallel.apps/coordinator/src/checkpointer.ts (2)
198-200
: Optional delay parameter broadens checkpointing flexibility.
Allowing an optionaldelayMs
parameter is a nice touch. Consider adding a brief code comment explaining typical use cases for the delay to benefit future maintainers.
321-603
: Robust checkpointing process with backoff and cancellation support.
This extensive logic adds configurable delays, exponential backoff, concurrency checks via abort signals, and thorough cleanup. A few considerations:• Concurrency Safety: Re-check the potential for multiple concurrent checkpoints for the same run in extremely high-load scenarios, although your abort controller approach helps mitigate.
• Error Handling Clarity: The fallback to “ERROR” reason is consistent, but confirm logs or downstream code do not require specialized states for “NO_SUPPORT”, “IN_PROGRESS”, etc.
• Maintainability: The method is large; consider splitting into smaller functions if it continues to expand.Overall, this is a solid implementation offering robust checkpoint management.
🧰 Tools
🪛 Biome (1.9.4)
[error] 398-398: Unnecessary continue statement
Unsafe fix: Delete the unnecessary continue statement
(lint/correctness/noUnnecessaryContinue)
[error] 600-600: Unsafe usage of 'return'.
'return' in 'finally' overwrites the control flow statements inside 'try' and 'catch'.
(lint/correctness/noUnsafeFinally)
apps/coordinator/src/index.ts (4)
38-41
: Validate environment variable inputs for checkpoint delays.Using
parseInt
on an empty or malformed string will returnNaN
, which then defaults to0
. While this is acceptable, consider clamping negative or nonsensical values for more robust error handling.
151-181
: Consider applying a typed reason interface inthis.#cancelCheckpoint
calls.Here, a generic reason object is passed. Defining an explicit interface for the
reason
parameter improves code clarity and helps maintain consistency across various cancellation points.
1131-1142
: Document or log the use ofWAIT_FOR_TASK_CHECKPOINT_DELAY_MS
.Applying a configurable delay is beneficial, but consider adding explicit log statements indicating when the delay is applied, to aid in debugging and operational clarity.
1473-1491
: Refine concurrency handling in#cancelCheckpoint
.
- The
reason
parameter is untyped. Defining an interface will streamline logging and maintenance.- Calling
#cancelCheckpoint
multiple times could lead to repeated logs. Consider whether repeated calls are expected and if additional safeguards (e.g., idempotent checks) are needed.internal-packages/database/prisma/schema.prisma (1)
1920-1920
: Consider indexing and documenting usage ofresumedAt
.If you plan to query filtering by
resumedAt
, adding an index can help performance. Also, clarifying the conditions under which it is set will aid maintainability.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
apps/coordinator/src/checkpointer.ts
(10 hunks)apps/coordinator/src/index.ts
(16 hunks)apps/webapp/app/v3/services/createCheckpoint.server.ts
(1 hunks)apps/webapp/app/v3/services/resumeTaskDependency.server.ts
(4 hunks)internal-packages/database/prisma/migrations/20250218124428_add_resumed_at_to_task_run_dependency/migration.sql
(1 hunks)internal-packages/database/prisma/schema.prisma
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: typecheck / typecheck
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (17)
apps/webapp/app/v3/services/resumeTaskDependency.server.ts (3)
6-6
: Import looks good.
No issues found with this newly added import statement.
54-65
: Validate concurrency behavior when skipping duplicate “resume”.
This logic correctly prevents re-resuming an already resumed dependency. However, if multiple processes attempt to resume the same dependency concurrently, consider ensuring that this check remains safe (e.g., via transaction-level isolation if not already guaranteed).
106-117
: Consistent handling for checkpoint vs. non-checkpoint dependencies.
This block mirrors the earlier resume flow, offering uniform coverage. The approach looks robust, and the debug logs help trace attempts that were already resumed.apps/coordinator/src/checkpointer.ts (6)
30-31
: Inconsistent usage of removed reasons.
Although the “DISABLED”, “IN_PROGRESS”, and “NO_SUPPORT” states are no longer in the union, they are still passed to#failCheckpoint
. To improve clarity, unify the handling of these additional states or remove references if they are truly deprecated.
90-98
: Refined approach for tracking aborted checkpoints.
Storing per-run abort controllers with#runAbortControllers
is a neat solution for concurrency control. This should streamline cancellation logic and reduce collision among multiple checkpoint operations.
242-246
: Early check for ongoing checkpoint operations.
Verifying#isRunCheckpointing(runId)
prevents duplicate checkpoint attempts in parallel. This is a vital safeguard to avoid concurrency issues.
260-270
: Accurate run duration metrics.
MeasuringdiffWithoutDelay
ensures insight into actual checkpoint operations, ignoring deliberate wait times. This can help differentiate performance bottlenecks from intentional delays.
287-288
: Helper method clarifies concurrent checkpoint checks.
#isRunCheckpointing
offers a concise way to confirm if the run is already processing a checkpoint, reducing repeated code.
291-319
: Comprehensive cancellation logic.
cancelAllCheckpointsForRun
efficiently aborts the active checkpoint if present, and handles the edge case where a checkpoint already failed. The logging clarity is beneficial for debugging.apps/coordinator/src/index.ts (7)
185-236
: Use consistent shape for cancellation reason objects.Similar to lines 151-181, we recommend extracting a typed interface for the
reason
parameter to standardize the shape and usage across different checkpoint cancellation scenarios.
303-315
: Leverage a typed reason interface for run cancellation events.This code reuses the cancellation logic with an unstructured reason object. Follow the same typed approach for better maintainability.
737-741
: Ensure uniform reason shape for completing tasks.Again, a typed interface for
reason
promotes consistent logging and easier refactoring if the reason object’s fields evolve.
925-929
: Maintain consistency in thereason
parameter structure.Reiterate the suggestion to unify how we pass and document checkpoint cancellation reasons throughout the code.
971-982
: Standardize the cancellation reason for checkpoint aborts.Same suggestion: define a typed reason interface, ensuring consistent fields across every cancellation site.
1033-1042
: Verify the correctness ofattemptNumber
retrieval.
getAttemptNumber()
could returnundefined
. Confirm downstream logic handles that gracefully, or add checks to prevent uninitialized checkpoint data.
1237-1248
: Ensure clarity aroundWAIT_FOR_BATCH_CHECKPOINT_DELAY_MS
.Like the task checkpoint delay, clarify in logs or docs when and why this batch-specific checkpoint delay is triggered.
internal-packages/database/prisma/migrations/20250218124428_add_resumed_at_to_task_run_dependency/migration.sql (1)
1-2
: Confirm UTC handling and default behavior for the newresumedAt
column.Ensure timestamps default to a consistent timezone (likely UTC) and confirm whether a default value or constraints are needed. If this column is queried often, consider indexing it for performance.
These changes will allow us to add an optional and configurable delay to task and batch wait checkpoints. This will reduce unnecessary overheads for dependencies that finish fast. Also includes a fix that prevents checkpoint creation for already resumed dependencies.
Summary by CodeRabbit
New Features
Refactor
Chores