Skip to content

579 composing cancelled by caller pr #583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dan-da
Copy link
Collaborator

@dan-da dan-da commented May 7, 2025

This is another PR where one thing sort of led to another as I got deeper into it, so it is much larger than originally planned/intended. I've squashed the original work-in-progress commits into a set of 6 commits for the primary areas of change. Reviewers might want to check ce1961b which is small and should close #579. Be aware however that there are problems with the tests that are fixed in later commits.


fix: continue composing if job-cancelled error occurs

closes #579

Avoids calling expect() on a function that can fail when a job is
cancelled.

Changes:

  • prove_concensus_program() returns JobHandleError::JobCancelled
    instead of anyhow::bail!("job cancelled by caller")
  • create_block_transaction_from() returns an error when
  • add mine_loop test job_cancel_msg_cancels_composing()
  • add mine_loop test msg_from_main_does_not_crash_composer()

test: add mine_loop tests for job cancellation

adds two tests that cancel a job while composing

Changes:

  • mine() fn accepts a flag to indicate if it should perform initial 60
    second sleep or not. This enables it to be called in unit tests.
  • add mine_loop test job_cancel_msg_cancels_composing()
  • add mine_loop test msg_from_main_does_not_crash_composer()

refactor: proving cancellation and error handling

This enables proving job-cancellation to work in test environment and
cleans up the proving result error path.

Changes:

  • ProverJob handles job cancellation in test environment
  • prove() and prove_concensus_program() now return a CreateProofError
  • remove derive(Serialize, Deserialize) from several error types
    because not all errors can be serialized which creates a tension
    between serialization and having a complete error source chain.
  • change RpcError to use String as error source for most variants.
    RpcError is really the only error-type that needs to be serialized
    and callers do not need the error source chain.
  • fix CreateProofError downcast when checking composer_task error result
  • add tests::shared::wait_until() fn for tests to wait on a condition
  • remove CreateProofError::Failed catch-all for anyhow and replace with
    specific error variants
  • add JobQueue::num_jobs() and num_queued_jobs() methods
  • simplify composer_task error handling in mine_loop

test: fix new mine-loop tests, for parallel usage

The two new mine-loop tests for cancellation were depending on starting
with 0 jobs in the job-queue and no outside job-queue additions during
the test. This is no longer a valid assumption when tests are run in
parallel.

The fix involves using a sleep instead of waiting for a specific number
of jobs. It is less "clean" but it works for both sequential and
parallel test mode.


refactor: job_queue::JobHandle improvements

Changes:

  • auto cancel job when JobHandle is dropped.
    fixes test case that continues composing job after test completes
  • impl Drop for JobHandle
  • JobHandle now impls Future, so it can be awaited directly. cancellation in
    prove_concensus_program() should cause the JobHandle to be dropped. not
    tested yet
  • refine JobHandle api. remove un-needed methods
  • job-cancellation works cleanly in prove_concensus_program() select!{} via
    tokio::pin
  • add tests cancel_async_job_in_select, cancel_sync_job_in_select and comment
    test worker
  • improve job-queue module doc

refactor: split queue.rs and move triton_vm

These are some changes to prepare for moving job_queue into its own
crate, once we have a neptune-core workspace.

src/job_queue/triton_vm is moved to src/triton_vm_job_queue. It has
neptune-core specific logic, and thus belongs in neptune-core crate.

Changes:

  • break job-queue into separate files for better organization
  • move src/job_queue/triton_vm to src/triton_vm_job_queue.
  • remove the LogWhenDropped newtype
  • refactor queue.rs and comment thoroughly.
  • add doc-comment with examples for JobHandle::cancel()

dan-da added 9 commits May 7, 2025 16:21
closes Neptune-Crypto#579

Avoids calling expect() on a function that can fail when a job is
cancelled.

Changes:
 + prove_concensus_program() returns JobHandleError::JobCancelled
   instead of anyhow::bail!("job cancelled by caller")
 + create_block_transaction_from() returns an error when
   Transaction::merge_with() fails instead of calling expect()
 + check for JobCancelled error if composer_task fails.
adds two tests that cancel a job while composing

Changes:
 + mine() fn accepts a flag to indicate if it should perform initial 60
   second sleep or not.  This enables it to be called in unit tests.
 + add mine_loop test job_cancel_msg_cancels_composing()
 + add mine_loop test msg_from_main_does_not_crash_composer()
This enables proving job-cancellation to work in test environment and
cleans up the proving result error path.

Changes:
 + ProverJob handles job cancellation in test environment
 + prove() and prove_concensus_program() now return a CreateProofError
 + remove derive(Serialize, Deserialize) from several error types
   because not all errors can be serialized which creates a tension
   between serialization and having a complete error source chain.
 + change RpcError to use String as error source for most variants.
   RpcError is really the only error-type that needs to be serialized
   and callers do not need the error source chain.
 + fix CreateProofError downcast when checking composer_task error result
 + add tests::shared::wait_until() fn for tests to wait on a condition
 + remove CreateProofError::Failed catch-all for anyhow and replace with
   specific error variants
 + add JobQueue::num_jobs() and num_queued_jobs() methods
 + simplify composer_task error handling in mine_loop
The two new mine-loop tests for cancellation were depending on starting
with 0 jobs in the job-queue and no outside job-queue additions during
the test.  This is no longer a valid assumption when tests are run in
parallel.

The fix involves using a sleep instead of waiting for a specific number
of jobs.  It is less "clean" but it works for both sequential and
parallel test mode.
Changes:
+ auto cancel job when JobHandle is dropped.
  fixes test case that continues composing job after test completes
+ impl Drop for JobHandle
+ JobHandle now impls Future, so it can be awaited directly.  cancellation in
  prove_concensus_program() should cause the JobHandle to be dropped. not
  tested yet
+ refine JobHandle api. remove un-needed methods
+ job-cancellation works cleanly in prove_concensus_program() select!{} via
  tokio::pin
+ add tests cancel_async_job_in_select, cancel_sync_job_in_select and comment
  test worker
+ improve job-queue module doc
These are some changes to prepare for moving job_queue into its own
crate, once we have a neptune-core workspace.

src/job_queue/triton_vm is moved to src/triton_vm_job_queue.  It has
neptune-core specific logic, and thus belongs in neptune-core crate.

Changes:
+ break job-queue into separate files for better organization
+ move src/job_queue/triton_vm to src/triton_vm_job_queue.
+ remove the LogWhenDropped newtype
+ refactor queue.rs and comment thoroughly.
+ add doc-comment with examples for JobHandle::cancel()
adds a test for stopping the job-queue and hopefully will make
coveralls happier.
The stop_queue() test revealed a logic/flow problem when
the process_job task receives a Stop message.

The tokio::select!{} was in the wrong place. It was selecting between
the stop message and the job_added message.  Instead it needs to select
between the running job and the stop message.

This commit makes that change and now the stop_queue() test passes.
Copy link
Member

@jan-ferdinand jan-ferdinand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand every implication of all the changes, but this looks like an improvement (or more accurately, a bunch of improvements) to me. I've left some nits & minor suggestions inline.

@jan-ferdinand
Copy link
Member

It's weird that overall coverage decreased by a substantial amount. From what I can see, there are only few newly uncovered lines overall, and a decent number of those weren't touched in this PR. As far as I can tell, the tools we use for coverage reporting have not changed either.

@dan-da
Copy link
Collaborator Author

dan-da commented May 8, 2025

It's weird that overall coverage decreased by a substantial amount. From what I can see, there are only few newly uncovered lines overall, and a decent number of those weren't touched in this PR. As far as I can tell, the tools we use for coverage reporting have not changed either.

yes I considered filing an issue, or anyway was planning to mention it to you.

I added two new tests to mine_loop.rs that increased coverage there by 15% or so. And also new tests in queue.rs that increased coverage.

Many of the newly "uncovered" lines don't really make sense to me, such as tracing::debug() calls that are surrounded by otherwise covered code.

dan-da added 2 commits May 8, 2025 16:05
addresses review comments for Neptune-Crypto#583.

Changes:
+ derive strum::Display for JobCompletion
+ improve JobCompletion::Panicked variant comment
+ reflow JobHandle::cancel doc-comment example
+ emit tracing in JobHandle::cancel()
+ improve JobHandle::job_id() doc-comment
+ calc self.cancel() In JobHandle drop impl instead of cancel_tx.
+ improve doc-comments in JobId
+ improve module comment in job_queue/mod.rs
+ use doc-comments instead of "//" for private items in queue.rs
+ remove unnecessary let statement when spawining process_job task.
+ move stop-signal processing into its own fn so process_jobs select
  is more readable.
+ change order of assert_eq params
Addresses review comments for Neptune-Crypto#583:

+ use #[cfg(not(test))] instead of bool param to disable initial sleep
  in mining loop for unit tests
+ improve comment in mining loop when checking composer_task error
+ rename RpcError::Error to RpcError::WalletError
+ remove redundant copy_dir_recursive() in src/tests/shared.rs
@dan-da
Copy link
Collaborator Author

dan-da commented May 9, 2025

addressed all review comments. thx @jan-ferdinand. CI is green except for coveralls which appears to be acting a bit funny. merging.

@dan-da dan-da merged commit 13d8570 into Neptune-Crypto:master May 9, 2025
12 of 13 checks passed
dan-da added a commit that referenced this pull request May 9, 2025
addresses review comments for #583.

Changes:
+ derive strum::Display for JobCompletion
+ improve JobCompletion::Panicked variant comment
+ reflow JobHandle::cancel doc-comment example
+ emit tracing in JobHandle::cancel()
+ improve JobHandle::job_id() doc-comment
+ calc self.cancel() In JobHandle drop impl instead of cancel_tx.
+ improve doc-comments in JobId
+ improve module comment in job_queue/mod.rs
+ use doc-comments instead of "//" for private items in queue.rs
+ remove unnecessary let statement when spawining process_job task.
+ move stop-signal processing into its own fn so process_jobs select
  is more readable.
+ change order of assert_eq params
This was referenced May 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"job cancelled by caller" error while composing causes panic or shutdown.
2 participants