Skip to content

Add support for deferred checkpoints #1721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 20, 2025
Merged

Conversation

nicktrn
Copy link
Collaborator

@nicktrn nicktrn commented Feb 20, 2025

These changes will allow us to add an optional and configurable delay to task and batch wait checkpoints. This will reduce unnecessary overheads for dependencies that finish fast. Also includes a fix that prevents checkpoint creation for already resumed dependencies.

Summary by CodeRabbit

  • New Features

    • Added a new capability to resume paused operations.
    • Introduced configurable delays for checkpoint processes, enabling smoother task management.
  • Refactor

    • Enhanced overall checkpoint and task dependency handling for improved operational reliability.
    • Strengthened error detection and logging, providing clearer insights during processing.
  • Chores

    • Updated the system’s data tracking to record when tasks or dependencies are resumed.

Copy link

changeset-bot bot commented Feb 20, 2025

⚠️ No Changeset found

Latest commit: 8afead3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link
Contributor

coderabbitai bot commented Feb 20, 2025

Walkthrough

This update refines the checkpoint management system across multiple components. In the coordinator, the Checkpointer and TaskCoordinator classes have been adjusted to use improved abort controller handling, configurable delay parameters, and enhanced error logging. The webapp services now include more granular handling of checkpoint reasons and dependency resumption. Database schema changes introduce a new column to track when task dependencies are resumed. These coordinated changes streamline checkpoint operations and dependency management within the system.

Changes

File(s) Change Summary
apps/coordinator/src/checkpointer.ts Updated the Checkpointer class: simplified error reasons, replaced #abortControllers with #runAbortControllers (storing both AbortSignal and abort method), added optional delayMs to checkpoint methods, introduced a new unpause method, and improved logging and method signatures.
apps/coordinator/src/index.ts Modified the TaskCoordinator: added new delay constants (WAIT_FOR_TASK_CHECKPOINT_DELAY_MS, WAIT_FOR_BATCH_CHECKPOINT_DELAY_MS), updated the cancel checkpoint method signature to include an optional reason parameter, and integrated these changes into various socket event handlers.
apps/webapp/app/v3/services/createCheckpoint.server.ts Enhanced the CreateCheckpointService: updated switch-case logic to handle new checkpoint reasons ("WAIT_FOR_TASK", "WAIT_FOR_BATCH"), added pre-checks to verify task/batch status, and improved error logging for missing or already resumed runs.
apps/webapp/app/v3/services/resumeTaskDependency.server.ts Improved the ResumeTaskDependencyService: added an import for TaskRunDependency and a new private method #setDependencyToResumedOnce to update a dependency’s resumedAt field before processing checkpoint resumes, preventing redundant resume actions.
internal-packages/database/prisma/migrations/.../migration.sql
internal-packages/database/prisma/schema.prisma
Updated database schema for TaskRunDependency: added a new column/field resumedAt (TIMESTAMP(3)/DateTime?) to track when a task dependency is resumed.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant CP as Checkpointer
    Client->>CP: checkpointAndPush(opts, delayMs)
    CP->>CP: Initialize runAbortControllers with AbortSignal & abort method
    alt Delay specified
        CP->>CP: Wait for delayMs (log duration)
    end
    CP->>CP: Invoke #checkpointAndPushWithBackoff(opts, { delayMs, signal })
    alt Abort signal triggered
        CP-->>Client: Return abort notification
    else Operation proceeds
        CP-->>Client: Return CheckpointData
    end
Loading
sequenceDiagram
    participant Service as ResumeTaskDependencyService
    participant DB as Database
    Service->>Service: Call #setDependencyToResumedOnce(dependency)
    alt Dependency already resumed
        Service-->>Service: Log debug, exit early
    else Dependency updated
        Service->>DB: Update resumedAt field in TaskRunDependency
        Service->>Service: Enqueue messages for resuming task dependency
    end
Loading

Possibly related PRs

Suggested reviewers

  • ericallam

Poem

I'm a rabbit with a nimble code hop,
Leaping through checkpoints that never stop.
Delays and aborts, now managed so neat,
With every change, the system's complete.
Happy hops in the code forest so deep!
🐰💻

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (8)
apps/webapp/app/v3/services/resumeTaskDependency.server.ts (1)

137-155: Well-defined setter for resumedAt, but consider potential race conditions.
The updateMany approach using id and resumedAt: null is effective. Just be mindful of possible concurrency scenarios if multiple calls to the same dependency occur in parallel.

apps/coordinator/src/checkpointer.ts (2)

198-200: Optional delay parameter broadens checkpointing flexibility.
Allowing an optional delayMs parameter is a nice touch. Consider adding a brief code comment explaining typical use cases for the delay to benefit future maintainers.


321-603: Robust checkpointing process with backoff and cancellation support.
This extensive logic adds configurable delays, exponential backoff, concurrency checks via abort signals, and thorough cleanup. A few considerations:

Concurrency Safety: Re-check the potential for multiple concurrent checkpoints for the same run in extremely high-load scenarios, although your abort controller approach helps mitigate.
Error Handling Clarity: The fallback to “ERROR” reason is consistent, but confirm logs or downstream code do not require specialized states for “NO_SUPPORT”, “IN_PROGRESS”, etc.
Maintainability: The method is large; consider splitting into smaller functions if it continues to expand.

Overall, this is a solid implementation offering robust checkpoint management.

🧰 Tools
🪛 Biome (1.9.4)

[error] 398-398: Unnecessary continue statement

Unsafe fix: Delete the unnecessary continue statement

(lint/correctness/noUnnecessaryContinue)


[error] 600-600: Unsafe usage of 'return'.

'return' in 'finally' overwrites the control flow statements inside 'try' and 'catch'.

(lint/correctness/noUnsafeFinally)

apps/coordinator/src/index.ts (4)

38-41: Validate environment variable inputs for checkpoint delays.

Using parseInt on an empty or malformed string will return NaN, which then defaults to 0. While this is acceptable, consider clamping negative or nonsensical values for more robust error handling.


151-181: Consider applying a typed reason interface in this.#cancelCheckpoint calls.

Here, a generic reason object is passed. Defining an explicit interface for the reason parameter improves code clarity and helps maintain consistency across various cancellation points.


1131-1142: Document or log the use of WAIT_FOR_TASK_CHECKPOINT_DELAY_MS.

Applying a configurable delay is beneficial, but consider adding explicit log statements indicating when the delay is applied, to aid in debugging and operational clarity.


1473-1491: Refine concurrency handling in #cancelCheckpoint.

  1. The reason parameter is untyped. Defining an interface will streamline logging and maintenance.
  2. Calling #cancelCheckpoint multiple times could lead to repeated logs. Consider whether repeated calls are expected and if additional safeguards (e.g., idempotent checks) are needed.
internal-packages/database/prisma/schema.prisma (1)

1920-1920: Consider indexing and documenting usage of resumedAt.

If you plan to query filtering by resumedAt, adding an index can help performance. Also, clarifying the conditions under which it is set will aid maintainability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e19bf2a and 8afead3.

📒 Files selected for processing (6)
  • apps/coordinator/src/checkpointer.ts (10 hunks)
  • apps/coordinator/src/index.ts (16 hunks)
  • apps/webapp/app/v3/services/createCheckpoint.server.ts (1 hunks)
  • apps/webapp/app/v3/services/resumeTaskDependency.server.ts (4 hunks)
  • internal-packages/database/prisma/migrations/20250218124428_add_resumed_at_to_task_run_dependency/migration.sql (1 hunks)
  • internal-packages/database/prisma/schema.prisma (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: units / 🧪 Unit Tests
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (17)
apps/webapp/app/v3/services/resumeTaskDependency.server.ts (3)

6-6: Import looks good.
No issues found with this newly added import statement.


54-65: Validate concurrency behavior when skipping duplicate “resume”.
This logic correctly prevents re-resuming an already resumed dependency. However, if multiple processes attempt to resume the same dependency concurrently, consider ensuring that this check remains safe (e.g., via transaction-level isolation if not already guaranteed).


106-117: Consistent handling for checkpoint vs. non-checkpoint dependencies.
This block mirrors the earlier resume flow, offering uniform coverage. The approach looks robust, and the debug logs help trace attempts that were already resumed.

apps/coordinator/src/checkpointer.ts (6)

30-31: Inconsistent usage of removed reasons.
Although the “DISABLED”, “IN_PROGRESS”, and “NO_SUPPORT” states are no longer in the union, they are still passed to #failCheckpoint. To improve clarity, unify the handling of these additional states or remove references if they are truly deprecated.


90-98: Refined approach for tracking aborted checkpoints.
Storing per-run abort controllers with #runAbortControllers is a neat solution for concurrency control. This should streamline cancellation logic and reduce collision among multiple checkpoint operations.


242-246: Early check for ongoing checkpoint operations.
Verifying #isRunCheckpointing(runId) prevents duplicate checkpoint attempts in parallel. This is a vital safeguard to avoid concurrency issues.


260-270: Accurate run duration metrics.
Measuring diffWithoutDelay ensures insight into actual checkpoint operations, ignoring deliberate wait times. This can help differentiate performance bottlenecks from intentional delays.


287-288: Helper method clarifies concurrent checkpoint checks.
#isRunCheckpointing offers a concise way to confirm if the run is already processing a checkpoint, reducing repeated code.


291-319: Comprehensive cancellation logic.
cancelAllCheckpointsForRun efficiently aborts the active checkpoint if present, and handles the edge case where a checkpoint already failed. The logging clarity is beneficial for debugging.

apps/coordinator/src/index.ts (7)

185-236: Use consistent shape for cancellation reason objects.

Similar to lines 151-181, we recommend extracting a typed interface for the reason parameter to standardize the shape and usage across different checkpoint cancellation scenarios.


303-315: Leverage a typed reason interface for run cancellation events.

This code reuses the cancellation logic with an unstructured reason object. Follow the same typed approach for better maintainability.


737-741: Ensure uniform reason shape for completing tasks.

Again, a typed interface for reason promotes consistent logging and easier refactoring if the reason object’s fields evolve.


925-929: Maintain consistency in the reason parameter structure.

Reiterate the suggestion to unify how we pass and document checkpoint cancellation reasons throughout the code.


971-982: Standardize the cancellation reason for checkpoint aborts.

Same suggestion: define a typed reason interface, ensuring consistent fields across every cancellation site.


1033-1042: Verify the correctness of attemptNumber retrieval.

getAttemptNumber() could return undefined. Confirm downstream logic handles that gracefully, or add checks to prevent uninitialized checkpoint data.


1237-1248: Ensure clarity around WAIT_FOR_BATCH_CHECKPOINT_DELAY_MS.

Like the task checkpoint delay, clarify in logs or docs when and why this batch-specific checkpoint delay is triggered.

internal-packages/database/prisma/migrations/20250218124428_add_resumed_at_to_task_run_dependency/migration.sql (1)

1-2: Confirm UTC handling and default behavior for the new resumedAt column.

Ensure timestamps default to a consistent timezone (likely UTC) and confirm whether a default value or constraints are needed. If this column is queried often, consider indexing it for performance.

@nicktrn nicktrn merged commit 7186b1e into main Feb 20, 2025
11 checks passed
@nicktrn nicktrn deleted the feat/deferred-checkpoints branch February 20, 2025 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant