Description
@mxinden - Your input here might also be helpful.
After a bit of digging, we've found that the gossipsub heartbeat can be set to the order of 1s but grow to the order of many minutes. This heartbeat is crucial for gossipsub to maintain its peers and meshes as well as bound its memory cache and gossip messages on time.
A many-minute heartbeat can cause terrible conditions on the network and significantly grow memory use of gossipsub users.
We've tracked the issue down to the custom implementation of Interval
in https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs.
As an example, we have three nodes running, in one (green one) I've replaced the custom Interval
with a tokio::time::Interval
. The heartbeat is supposed to be 700ms.
Halfway through these graphs I've switched the green node to tokio::time::Interval
. The heartbeat interval (time between heartbeats) then stays steady on 700ms whereas the other nodes have grown and continue to do so.
It's also useful to note that the heartbeat duration (time taken to do a heartbeat) itself is about 3-4x faster with tokio::time::Interval
than with the custom implementation we are currently using.
After a very quick skim, I think the issue of the growing interval lies in this calcualtion: https://github.com/libp2p/rust-libp2p/blob/master/protocols/gossipsub/src/interval.rs#L86
Where if we miss a interval the fires_at
and delay
can potentially become out of sync.
It is useful to note that the heartbeat interval grows in multiples of the desired interval.
It would be nice to get the current implementation to work correctly and be as performant as the tokio version. Failing this, potentially we add a feature flag or something so that tokio users can benefit from the performance gains.