Skip to content

Don't Always Broadcast latest state on startup if we'd previously closed-without-broadcast #1563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TheBlueMatt opened this issue Jun 23, 2022 · 8 comments · Fixed by #1564
Labels
good first issue Good for newcomers Take a Friday Leave a Friday Stomp the Bugs, Without Much Commitment
Milestone

Comments

@TheBlueMatt
Copy link
Collaborator

On startup the ChannelManager will broadcast the latest state transaction(s) for any ChannelMonitor it has for which it has no Channel. This is all fine and great except if we'd previously closed the ChannelMonitor with a ChannelForceClosed { should_broadcast: false } indicating it may be unsafe to broadcast. In this case, we should disable automated broadcasting. This should just require tracking the should_broadcast state in the ChannelMonitor itself and checking it before broadcasting (maybe a new automated flag on broadcast_Latest_holder_commitment_txn?)

This is really "the" issue on #775 so I'm gonna close that in favor of this one.

@TheBlueMatt TheBlueMatt added the good first issue Good for newcomers label Jun 23, 2022
@TheBlueMatt TheBlueMatt added this to the 0.1 milestone Jun 23, 2022
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 23, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 24, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 24, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
@QPotato
Copy link

QPotato commented Jun 25, 2022

I want to take this as my first issue.

When you say "On startup the ChannelManager will broadcast the latest state transaction(s) for any ChannelMonitor it has for which it has no Channel" I assume you mean channelmanager.rs:6823-6828 where it checks for each channel monitor if it's funding transaction is not in the funding transactions set, right? Then if I understand this correctly, when de-serializing the persisted state, every ChannelMonitorUpdateStep for a ChannelMonitor is replayed so that we build up to the state we had before the last shutdown. Then, the solution you propose would be to

  • Add a new property allow_automated_broadcast to ChannelMonitorImpl that defaults to true
  • On update_monitor, when matching ChannelForceClosed with should_broadcast: false set the flag to false
  • On broadcast_latest_holder_commitment_txn check for the flag and if it's false then... log_info! and do nothing?

Some questions I have:

  • It feels odd to have a method called braodcast... that won't always broadcast. Should we either:

    • Rename the method to maybe_broadcast_latest_holder_commitment_txn, so that it's explicit that under certain conditions it will not broadcast?
    • Define a new method maybe_broadcast... that wraps up the current one but checks the flag first and use this new method in this specific use case?

    I'm also asking because the method is also called when the de-serialized ChannelManager is stale compared to the ChannelMonitor, though in this case there is a channel in the founding transactions set (lines 6789-6804). I assume in that case too we want the flag to stop us, but I want to make sure I'm not changing some important semantics.

  • Is there any other ChannelMonitorUpdateStep that would imply that it's safe again to broadcast? Looking at the docs for the Enum I would guess no, but asking just in case.

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this issue Jun 25, 2022
When we receive a `channel_reestablish` with a `data_loss_protect`
that proves we're running with a stale state, instead of
force-closing the channel, we immediately panic. This lines up with
our refusal to run if we find a `ChannelMonitor` which is stale
compared to our `ChannelManager` during `ChannelManager`
deserialization. Ultimately both are an indication of the same
thing - that the API requirements on `chain::Watch` were violated.

In the "running with outdated state but ChannelMonitor(s) and
ChannelManager lined up" case specifically its likely we're running
off of an old backup, in which case connecting to peers with
channels still live is explicitly dangerous. That said, because
this could be an operator error that is correctable, panicing
instead of force-closing may allow for normal operation again in
the future (cc lightningdevkit#1207).

In any case, we provide instructions in the panic message for how
to force-close channels prior to peer connection, as well as a note
on how to broadcast the latest state if users are willing to take
the risk.

Note that this is still somewhat unsafe until we resolve lightningdevkit#1563.
@TheBlueMatt TheBlueMatt reopened this Jun 27, 2022
@TheBlueMatt
Copy link
Collaborator Author

Heh, 1564 didn't resolve this, it just referenced it...

@TheBlueMatt
Copy link
Collaborator Author

Yes, your understanding is correct, as for your questions:

It feels odd to have a method called braodcast... that won't always broadcast. Should we either:

Probably the second, its still important that users be able to broadcast the latest state if they want to, so we at least need an option to broadcast the state, though it also wouldn't be unreasonable to split it into two methods - maybe_ and force_..._unsafe to indicate that the first will only work if we think its okay to do so, and the second will force it to happen even if its unsafe to do so.

I'm also asking because the method is also called when the de-serialized ChannelManager is stale compared to the ChannelMonitor, though in this case there is a channel in the founding transactions set (lines 6789-6804). I assume in that case too we want the flag to stop us, but I want to make sure I'm not changing some important semantics.

I think in that case, yes, the flag should stop us (though of course generally the flag is not set).

Is there any other ChannelMonitorUpdateStep that would imply that it's safe again to broadcast? Looking at the docs for the Enum I would guess no, but asking just in case.

No, once its unsafe its unsafe. Unsafe here means we've lost some data, there's no way to recover that data.

@QPotato
Copy link

QPotato commented Jun 27, 2022

Great. I'll get to work into the PR then. Thanks!

@QPotato
Copy link

QPotato commented Jul 4, 2022

I'm having a hard time trying to test the code. I'll keep working on it but I'm posting my progress here in case anyone wants to throw help/comments/ideas. My idea would be to cover this new behavior with a simple unit test that:

  • Builds a ChannelMonitor
  • Sends it a ForceClosed{should_broadcast: false}
  • Writes and reads it again
  • And then tries to call the new methods and asserts whether the transaction was broadcasted.

My plan was to copy and adapt some other other unit test for broadcast_latest_holder_commitment_txn but couldn't find any. If I understand correctly, you write unit tests for this at the end of the module and integration tests in separate modules, mostly in funcional_tests.rs. I also found:

  • A TestPersister that I think implement Write And Read in-memory.
  • A TestBroadcaster that can probably help me with the assertions
  • A check_closed_broadcast! which I believe won't be useful here cause it's checking whether we broadcasted the channel closed message as a gossip update for network peers. Though I'm not really sure I understood correctly.
  • On your last PR Matt, you added do_test_data_loss_protect which seems to be a more general case integration test. Maybe I should just (or also) extend that one?

@TheBlueMatt
Copy link
Collaborator Author

I left a small pointer on your PR and let's continue there but in general its probably simplest and most-coverage to add a new functional test based on an existing one.

@TheBlueMatt TheBlueMatt added the Take a Friday Leave a Friday Stomp the Bugs, Without Much Commitment label Jun 13, 2024
@TheBlueMatt TheBlueMatt modified the milestones: 0.1, 0.2 Dec 5, 2024
@philipp1992
Copy link

2025-01-30T10:33:55.371Z hydra_bitcoin::node::logger:22: A ChannelManager is stale compared to the current ChannelMonitor!
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: The channel will be force-closed and the latest commitment transaction from the ChannelMonitor broadcast.
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: The ChannelMonitor for channel b37b4660dec0340c000ab33a9c877d671a7e4fe3f7f9945cd9704668e760430f is at update_id 27 but the ChannelManager is at update_id 26.
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: The ChannelMonitor for channel b37b4660dec0340c000ab33a9c877d671a7e4fe3f7f9945cd9704668e760430f is at holder commitment number 281474976710644 but the ChannelManager is at holder commitment number 281474976710645.
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: A ChannelManager is stale compared to the current ChannelMonitor!
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: The channel will be force-closed and the latest commitment transaction from the ChannelMonitor broadcast.
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: The ChannelMonitor for channel d91933328927a0fcd20613150cbb852bf8d9d5873b716ab44ce50d504975b8a4 is at update_id 15 but the ChannelManager is at update_id 13.
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: The ChannelMonitor for channel d91933328927a0fcd20613150cbb852bf8d9d5873b716ab44ce50d504975b8a4 is at holder commitment number 281474976710649 but the ChannelManager is at holder commitment number 281474976710650.
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025
hydra_node.js:9025 ERROR 2025-01-30T10:33:55.372Z hydra_bitcoin::node::logger:22: The ChannelMonitor for channel d91933328927a0fcd20613150cbb852bf8d9d5873b716ab44ce50d504975b8a4 is at revoked counterparty transaction number 281474976710650 but the ChannelManager is at revoked counterparty transaction number 281474976710651.
imports.wbg.__wbg_error_80de38b3f7cc3c3c @ hydra_node.js:9025

is this related?

@wpaulino
Copy link
Contributor

wpaulino commented Jan 30, 2025

@philipp1992 not quite. It seems an update was made to the channel that was persisted at the ChannelMonitor, but not at the ChannelManager. Are you using async monitor persistence? Did the node crash or not shutdown cleanly? Sharing some more logs before the restart may be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers Take a Friday Leave a Friday Stomp the Bugs, Without Much Commitment
Projects
None yet
4 participants