feat: implement echo connection check with ping/pong mechanism #1687

pbeza · 2025-12-17T17:29:05Z

Fixes #1647

…onnection-check

pbeza · 2025-12-19T18:02:32Z

I’m not particularly proud of this PR, mainly because it builds on top of an already messy P2P implementation and adds more complexity. To do ping/pong properly, this layer would really need a complete redesign (and if it had been designed well in the first place, the ping-pong mechanism would probably be redundant anyway).

As for the comments I added: they helped me a lot in understanding the networking layer, but I can imagine they might feel excessive to someone already familiar with the code. If you think they should be removed, I’m totally fine with that.

That said, feel free to review.

Copilot

Pull request overview

This PR implements a ping/pong heartbeat mechanism to actively monitor P2P connection health, addressing issue #1647. The implementation adds bidirectional health checks where each node periodically sends ping packets to its peers and expects pong responses, automatically closing connections that fail to respond within a timeout period.

Key changes:

Added sequence-numbered Ping/Pong packets for connection health monitoring
Implemented a watchdog task that monitors pong timeouts and closes dead connections
Added PongState shared between incoming/outgoing streams to track connection health with unidirectional I/O

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/node/src/p2p.rs

…onnection-check

pbeza · 2025-12-22T17:39:40Z

@gilcu3 any idea whether this test is flaky, or if there’s something wrong with my implementation?

ERROR tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls[3] - AssertionError: Nodes did not reach expected MPC presignature counts (available) before timeout.

BTW, I’ve refined the implementation a bit since I wrote this, so feel free to review it now (cc @gilcu3 @DSharifi).

gilcu3 · 2025-12-22T18:24:08Z

@gilcu3 any idea whether this test is flaky, or if there’s something wrong with my implementation?
ERROR tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls[3] - AssertionError: Nodes did not reach expected MPC presignature counts (available) before timeout.

That test has something weird, as explained in #1691. But indeed here I am seeing:

...
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
Available presignatures: [4, 0, 0, 5, 5, 3]
E

which means that after some time 2 nodes still did not have any presignature, so there must be something not working as expected.

pbeza · 2025-12-22T19:06:11Z

@gilcu3 any idea whether this test is flaky, or if there’s something wrong with my implementation?
ERROR tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls[3] - AssertionError: Nodes did not reach expected MPC presignature counts (available) before timeout.
That test has something weird, as explained in #1691. But indeed here I am seeing:

(...)

@gilcu3 I ran this test locally and it passes for me, so it seems CI-specific unless I’m missing something:

((venv) ) ➜  pytest git:(1647-implement-echo-connection-check) ✗ pytest --non-reproducible  tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls     
================================================================================ test session starts ================================================================================
platform darwin -- Python 3.12.12, pytest-8.3.4, pluggy-1.6.0
rootdir: /Users/patryk/mpc/pytest
configfile: pytest.ini
plugins: libtmux-0.53.0, locust-2.42.6, rerunfailures-16.1
collected 1 item                                                                                                                                                                    

tests/robust_ecdsa/test_parallel_sign_calls.py 

.                                                                                                                              [100%]

=========================================================================== 1 passed in 355.51s (0:05:55) ===========================================================================

((venv) ) ➜  pytest git:(1647-implement-echo-connection-check) ✗ pytest --non-reproducible "tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls[3]"
================================================================= test session starts =================================================================
platform darwin -- Python 3.12.12, pytest-8.3.4, pluggy-1.6.0
rootdir: /Users/patryk/mpc/pytest
configfile: pytest.ini
plugins: libtmux-0.53.0, locust-2.42.6, rerunfailures-16.1
collected 1 item                                                                                                                                      

tests/robust_ecdsa/test_parallel_sign_calls.py .                                                                                                [100%]

============================================================ 1 passed in 79.64s (0:01:19) ============================================================

((venv) ) ➜  pytest git:(1647-implement-echo-connection-check) ✗ git rev-parse HEAD
e85ca872feeed5eef72623ba02adcce57272fa4f

gilcu3 · 2025-12-22T19:13:53Z

@gilcu3 any idea whether this test is flaky, or if there’s something wrong with my implementation?
ERROR tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls[3] - AssertionError: Nodes did not reach expected MPC presignature counts (available) before timeout.
That test has something weird, as explained in #1691. But indeed here I am seeing:
(...)

@gilcu3 I ran this test locally and it passes for me, so it seems CI-specific unless I’m missing something:

((venv) ) ➜  pytest git:(1647-implement-echo-connection-check) ✗ pytest --non-reproducible  tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls     
================================================================================ test session starts ================================================================================
platform darwin -- Python 3.12.12, pytest-8.3.4, pluggy-1.6.0
rootdir: /Users/patryk/mpc/pytest
configfile: pytest.ini
plugins: libtmux-0.53.0, locust-2.42.6, rerunfailures-16.1
collected 1 item                                                                                                                                                                    

tests/robust_ecdsa/test_parallel_sign_calls.py 

.                                                                                                                              [100%]

=========================================================================== 1 passed in 355.51s (0:05:55) ===========================================================================

((venv) ) ➜  pytest git:(1647-implement-echo-connection-check) ✗ git rev-parse HEAD
e85ca872feeed5eef72623ba02adcce57272fa4f

Interesting, so all tests are passing locally for you now? Did you rebuild the mpc-node (or rebase on main, as it now occurs automatically)? If yes to all, can you try disabling that one only and see if CI passes?

…ion handling

This reverts commit d047c9a.

pbeza · 2025-12-23T12:05:58Z

(...) can you try disabling that one only and see if CI passes?

Unfortunately, commenting out that one test didn’t help. There seems to be something going on during resharing when some nodes disconnect and then reconnect. I’ll get back to this early next year (🎄), as I wasn’t able to even quickly monkey-patch it, unfortunately.

…onnection-check

pbeza · 2026-01-05T16:01:53Z

Ready for review now — I fixed the failing test (context):

pytest --non-reproducible "tests/robust_ecdsa/test_parallel_sign_calls.py::test_parallel_sign_calls[3]"

The fix itself was straightforward: just bumping PONG_TIMEOUT (430dc76).

Root cause: the initial ping/pong implementation could cause connection churn during resharing due to a timing issue. When Node A receives a Ping from Node B on its incoming connection, it needs to send a Pong back via its outgoing connection to Node B. During resharing or initial connection setup, that outgoing connection may not be ready yet, so pongs end up being buffered.

By the time the outgoing connection is established, Node B has already hit the 5-second PONG_TIMEOUT, closes the connection, and retries. This creates a reconnection loop that prevents presignature generation. That’s why I bumped PONG_TIMEOUT to 20s, which makes the test pass.

The test that used to fail now takes ~1m17s on this branch. On main, it’s roughly the same, so there doesn’t seem to be any regression.

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-06T11:49:13Z

crates/node/src/p2p.rs

-                            tracing::info!(
-                                "Could not connect to {}, retrying: {}, me {}",
+                            tracing::warn!(
+                                "Could not connect to {}: {:#}, retrying (me: {})",


The error message format on line 449 uses {:#} for the error which provides the alternate debug format, but mixing it with the parenthetical "(me: {})" creates inconsistent formatting. Consider using a consistent format like "Could not connect to {}: {} (me: {})" with regular formatting, or explain why alternate formatting is needed here.

Suggested change

"Could not connect to {}: {:#}, retrying (me: {})",

"Could not connect to {}: {}, retrying (me: {})",

Copilot · 2026-01-06T11:49:14Z

crates/node/src/p2p.rs

+            &format!("Ping sender for {}", target_participant_id),
            async move {
+                let mut seq: u64 = 0;
+                let mut last_received_pong_seq: u64 = 0;


The last_received_pong_seq is initialized to 0 on line 305, but the watch channel is also initialized with seq: 0 on line 263. This means if a spurious changed() event fires before the first ping is sent, the keepalive task will read seq: 0 and compare it to last_received_pong_seq: 0, treating it as a stale pong. While this is likely harmless, consider initializing last_received_pong_seq to a sentinel value like u64::MAX to make the initial state more explicit.

Copilot · 2026-01-06T11:49:14Z

crates/node/src/p2p.rs

+                                    conn.connectivity.any_outgoing_connection()
+                                {
+                                    // Send the new pong info via watch channel
+                                    let _ = outgoing_conn.pong_tx.send(PongInfo { seq });


The error from pong_tx.send() on line 623 is silently ignored with let _ =. If the watch channel receiver is closed (which happens when the keepalive task exits), this could indicate the connection is in an inconsistent state. Consider logging a warning when this send fails to help diagnose connection lifecycle issues.

Suggested change

let _ = outgoing_conn.pong_tx.send(PongInfo { seq });

if let Err(err) = outgoing_conn.pong_tx.send(PongInfo { seq }) {

tracing::warn!(

"Failed to forward Pong({}) to keepalive for {}: {}",

seq,

peer_id,

err

);

}

Copilot · 2026-01-06T11:49:14Z

crates/node/src/p2p.rs

+                            // Register connection and drain any buffered Pongs
+                            connectivity.set_outgoing_connection(&new_conn);
+                            while let Ok(seq) = pong_buffer_rx.try_recv() {
+                                let _ = new_conn.sender.send(Packet::Pong(seq));
+                            }


There's a race condition in the pong buffer draining logic. After registering the connection on line 440, pongs could arrive and be sent via pong_tx.send() (line 623) while the buffer is being drained on lines 441-443. This could result in pongs being sent in the wrong order: buffered pong N+1 sent after live pong N. Consider draining the buffer before registering the connection, or using a mutex to ensure atomic drain-and-register.

Suggested change

// Register connection and drain any buffered Pongs

connectivity.set_outgoing_connection(&new_conn);

while let Ok(seq) = pong_buffer_rx.try_recv() {

let _ = new_conn.sender.send(Packet::Pong(seq));

}

// Drain any buffered Pongs, then register connection

while let Ok(seq) = pong_buffer_rx.try_recv() {

let _ = new_conn.sender.send(Packet::Pong(seq));

}

connectivity.set_outgoing_connection(&new_conn);

Copilot · 2026-01-06T11:49:15Z

crates/node/src/p2p.rs

+                                        tracing::info!(
+                                            "Outgoing connection to {} is dead, closing incoming connection for clean reconnect",
+                                            peer_id
+                                        );
+                                        break;


The incoming handler breaks the connection loop on line 598 when the outgoing sender channel is closed, but this creates an asymmetric failure mode. If the outgoing connection fails, it will automatically reconnect via PersistentConnection, but breaking the incoming connection here forces the peer to also reconnect their outgoing connection. Consider logging this as info rather than breaking, or add a comment explaining why both connections must be reset together.

Suggested change

tracing::info!(

"Outgoing connection to {} is dead, closing incoming connection for clean reconnect",

peer_id

);

break;

// The outgoing connection (and its sender) may be temporarily down.

// Do not tear down the incoming connection here: the outgoing side

// will be re-established via PersistentConnection, and keeping the

// incoming connection alive avoids forcing the peer to reconnect

// its own outgoing connection.

tracing::info!(

"Outgoing connection to {} is dead; keeping incoming connection open and relying on reconnect",

peer_id

);

Copilot · 2026-01-06T11:49:15Z

crates/node/src/p2p.rs

    ) -> anyhow::Result<PersistentConnection> {
-        let connectivity_clone = connectivity.clone();
+        let connectivity_clone = Arc::clone(&connectivity);
+        let (pong_buffer_tx, mut pong_buffer_rx) = mpsc::unbounded_channel::<u64>();


The pong buffer uses an unbounded channel which could accumulate pongs indefinitely if the outgoing connection never establishes. During extended network partitions, this could lead to unbounded memory growth. Consider using a bounded channel with a small capacity, or adding periodic cleanup of old buffered pongs to prevent memory leaks.

Copilot · 2026-01-06T11:49:15Z

crates/node/src/p2p.rs

+    /// Interval between consecutive pings. A new ping is sent 1 second after the previous
+    /// ping was sent, but only after receiving its pong response.


The comment states "A new ping is sent 1 second after the previous ping was sent, but only after receiving its pong response." This is misleading - the code on lines 341-344 shows that the wait is until 1 second has elapsed since the ping was sent, not 1 second after receiving the pong. If the pong arrives after 0.5 seconds, the next ping is sent 0.5 seconds later (total 1 second from previous ping). Consider clarifying the comment to say "Pings are sent at 1-second intervals from when the previous ping was sent, waiting for pong response before sending the next."

Suggested change

/// Interval between consecutive pings. A new ping is sent 1 second after the previous

/// ping was sent, but only after receiving its pong response.

/// Interval between consecutive pings. Pings are sent at 1-second intervals from when

/// the previous ping was sent, waiting for its pong response before sending the next.

Copilot · 2026-01-06T11:49:15Z

crates/node/src/p2p.rs

+                                let expected_seq = last_received_pong_seq + 1;
+                                if pong_info.seq != expected_seq {
+                                    tracing::warn!(
+                                        "Received pong {} from {}, expected {}, lost {} pong(s)",


The warning message on lines 323-326 logs when pongs are received out of sequence, but the message format could be clearer. The phrase "lost N pong(s)" is ambiguous - it's not clear if the pongs were actually lost or just received in the wrong order. Consider rephrasing to "received pong {} from {}, expected {} (gap of {} pong(s))" to make it clear this is a sequence gap detection rather than a confirmed loss.

Suggested change

"Received pong {} from {}, expected {}, lost {} pong(s)",

"Received pong {} from {}, expected {} (gap of {} pong(s))",

DSharifi

Thanks for the work put into this.

Let's discuss the implementation over a call when you are back tomorrow. I think we can simplify things by not buffering pong messages, and only track the latest sequence number that has been sent with a ping.

The (un)structured logging is also a blocker.

DSharifi · 2026-01-06T13:23:17Z

crates/node/src/p2p.rs

+    /// Channel for buffering Pongs when outgoing connection is temporarily unavailable.
+    pong_buffer: UnboundedSender<u64>,


Don't use an UnboundedSender, if we can't consume incoming pong messages fast enough that we start buffering beyond a reasonable limit we should just shed them.

Also, why do we need to buffer multiple pongs to the same peer? Isn't the last sequence number the only one we care about?

DSharifi · 2026-01-06T13:54:59Z

crates/node/src/p2p.rs

+                                let expected_seq = last_received_pong_seq + 1;
+                                if pong_info.seq != expected_seq {
+                                    tracing::warn!(
+                                        "Received pong {} from {}, expected {}, lost {} pong(s)",


Please use structured logging for these values

https://docs.rs/tracing/latest/tracing/#recording-fields

DSharifi · 2026-01-06T14:05:36Z

crates/node/src/p2p.rs

+                    // Wait for either a pong response or timeout
+                    tokio::select! {


You don't need to use a select! here if you want a timeout.

Checkout tokio::time::timeout instead.

DSharifi · 2026-01-06T14:10:14Z

crates/node/src/p2p.rs

+                                    conn.connectivity.any_outgoing_connection()
+                                {
+                                    // Send the new pong info via watch channel
+                                    let _ = outgoing_conn.pong_tx.send(PongInfo { seq });


Why are we silently dropping the error?

DSharifi · 2026-01-06T15:10:51Z

crates/node/src/p2p.rs

                            );
-                            new_conn
+
+                            // Register connection and drain any buffered Pongs


This comment is confusing. "drain" implies we'd discard the buffered pongs, but in the code below you are actually consuming the pongs to send them?

DSharifi · 2026-01-06T15:11:53Z

crates/node/src/p2p.rs

-                            tracing::info!(
-                                "Could not connect to {}, retrying: {}, me {}",
+                            tracing::warn!(
+                                "Could not connect to {}: {:#}, retrying (me: {})",


Same comment as above regarding structured logging

DSharifi · 2026-01-06T17:20:47Z

crates/node/src/p2p.rs

+                                {
+                                    if outgoing_conn.sender.send(Packet::Pong(seq)).is_err() {
+                                        tracing::info!(
+                                            "Outgoing connection to {} is dead, closing incoming connection for clean reconnect",


Where is this clean reconnect triggered?

DSharifi · 2026-01-06T17:21:34Z

crates/node/src/p2p.rs

+/// This struct manages **outgoing connections only** - one persistent TLS connection to each
+/// peer in the network. When the application wants to send a message to a peer, it queues the
+/// message through the corresponding [`PersistentConnection`], which handles automatic
+/// reconnection if the connection drops. Each connection runs two background tasks: one for
+/// sending data, and one for sending 1-second interval ping heartbeats and monitoring pong responses.
+///
+/// Implements the [`MeshNetworkTransportSender`] trait to provide a high-level API for sending
+/// messages (`.send()`, `.send_indexer_height()`) and checking connectivity status
+/// (`.connectivity()`, `.wait_for_ready()`), while handling low-level connection management.
 pub struct TlsMeshSender {
+    /// The participant ID of this node.
    my_id: ParticipantId,
+    /// List of all participant IDs in the network (including this node).
    participants: Vec<ParticipantId>,
+    /// Outgoing connections to all peers (excludes this node). Each connection automatically
+    /// retries on failure. This is where actual message sending happens - when you call
+    /// `.send()`, it looks up the connection here and queues the message.
    connections: HashMap<ParticipantId, Arc<PersistentConnection>>,
+    /// Tracks connection state (incoming and outgoing) for all peers. This is separate from
+    /// `connections` because it monitors *both directions* - while `connections` only manages
+    /// our outgoing connections, `connectivities` tracks whether both our outgoing connection
+    /// to a peer AND their incoming connection to us are alive. Used by `.wait_for_ready()`
+    /// and `.connectivity()` to check bidirectional connectivity status.
    connectivities: Arc<AllNodeConnectivities<TlsConnection, ()>>,
 }

-/// Implements MeshNetworkTransportReceiver.
+/// This struct manages **incoming connections only** - it accepts TLS connections from all
+/// peers and multiplexes their messages into a single channel. The application calls
+/// `.receive()` to get the next message from any peer. Each incoming connection runs its own
+/// background task that reads from the TLS stream, handles Ping/Pong packets, and forwards
+/// MPC/IndexerHeight messages to the unified receiver channel.
+///
+/// Ping/Pong handling uses cross-stream communication to maintain unidirectional I/O: when
+/// receiving a Ping, this handler sends Pong via the outgoing connection to that peer; when
+/// receiving a Pong, it notifies the outgoing connection's keepalive task via a watch channel.
+///
+/// Implements [`MeshNetworkTransportReceiver`] to receive messages from all peers in the
+/// mesh network.
 pub struct TlsMeshReceiver {
+    /// Unified message queue receiving messages from all peers' incoming connections.
+    /// When any peer sends us a message, it gets queued here. The application calls
+    /// `.receive()` to dequeue the next message (which includes the sender's ID).
    receiver: UnboundedReceiver<PeerMessage>,
+    /// Background task running the TCP acceptor loop on our listening port. It continuously
+    /// accepts incoming TCP connections and spawns a new task for each one that:
+    /// 1) Performs TLS handshake and authenticates the peer's identity
+    /// 2) Registers the connection with `connectivities` for bidirectional tracking
+    /// 3) Reads messages from the peer in a loop (read-only stream usage)
+    /// 4) On Ping: Sends Pong via our outgoing connection to maintain unidirectional I/O
+    /// 5) On Pong: Notifies the outgoing connection's keepalive task via watch channel
+    /// 6) Forwards MpcMessage and IndexerHeight to the unified `receiver` channel
+    ///
+    /// The [`AutoAbortTask`] wrapper ensures automatic cleanup on drop.
    _incoming_connections_task: AutoAbortTask<()>,
 }

-/// Maps public keys to participant IDs. Used to identify incoming connections.
+/// Maps public keys to [`ParticipantId`]s for authenticating incoming connections.
+///
+/// This struct is populated at startup with the known public keys of all participants in the
+/// network. When a peer establishes an incoming TLS connection, we extract their public key
+/// from their TLS certificate and look it up in this map to determine their [`ParticipantId`].
+/// This ensures that only known participants can connect, and we can correctly attribute
+/// incoming messages to the right peer. If a connection presents an unknown public key, it is
+/// rejected during the authentication phase.


Appreciate you taking time to document all this :)

feat: implement echo connection check with ping/pong mechanism

976d545

Fixes #1647

pbeza linked an issue Dec 17, 2025 that may be closed by this pull request

Implement echo connection check #1647

Open

pbeza added 9 commits December 17, 2025 18:29

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

3f74993

…onnection-check

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

512bc89

…onnection-check

Add more documentation to p2p.rs

e51c665

Use separate channels for sending and receiving ping/pong messages

9f18a65

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

ec728f7

…onnection-check

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

67cf900

…onnection-check

Fix clippy

081fce5

Fixes

ab7b91d

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

22765ae

…onnection-check

pbeza requested review from DSharifi and gilcu3 December 19, 2025 18:02

pbeza marked this pull request as ready for review December 19, 2025 18:02

Copilot AI review requested due to automatic review settings December 19, 2025 18:02

Copilot started reviewing on behalf of pbeza December 19, 2025 18:03 View session

Copilot AI reviewed Dec 19, 2025

View reviewed changes

crates/node/src/p2p.rs Outdated Show resolved Hide resolved

crates/node/src/p2p.rs Outdated Show resolved Hide resolved

crates/node/src/p2p.rs Outdated Show resolved Hide resolved

pbeza added 7 commits December 19, 2025 19:16

Address Copilot's code review

bbdcc29

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

f6b5471

…onnection-check

simplify

47f630b

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

e68d25d

…onnection-check

fixup! simplify

615a040

fixup! fixup! simplify

cd9ffb5

fixup! fixup! fixup! simplify

e85ca87

skip test_parallel_sign_calls for now

d047c9a

pbeza added 3 commits December 23, 2025 11:41

Implement Pong buffering in PersistentConnection for improved connect…

b026830

…ion handling

fix clippy

ff536cd

Revert "skip test_parallel_sign_calls for now"

d4fd604

This reverts commit d047c9a.

pbeza added 2 commits January 5, 2026 10:07

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

2498761

…onnection-check

Bump PONG_TIMEOUT to 20 seconds

430dc76

pbeza requested a review from kevindeforth January 5, 2026 12:40

pbeza added 4 commits January 5, 2026 13:42

Update docstrings

b8490a8

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

894b8b7

…onnection-check

Merge remote-tracking branch 'origin/main' into 1647-implement-echo-c…

bd2f0c9

…onnection-check

Code cleanup

22bb772

pbeza requested a review from Copilot January 6, 2026 11:44

Copilot started reviewing on behalf of pbeza January 6, 2026 11:45 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

DSharifi requested changes Jan 6, 2026

View reviewed changes

	"Could not connect to {}: {:#}, retrying (me: {})",
	"Could not connect to {}: {}, retrying (me: {})",

-                                    let _ = outgoing_conn.pong_tx.send(PongInfo { seq });
+                                    if let Err(err) = outgoing_conn.pong_tx.send(PongInfo { seq }) {
+                                        tracing::warn!(
+                                            "Failed to forward Pong({}) to keepalive for {}: {}",
+                                            seq,
+                                            peer_id,
+                                            err
+                                        );
+                                    }

-                                        tracing::info!(
-                                            "Outgoing connection to {} is dead, closing incoming connection for clean reconnect",
-                                            peer_id
-                                        );
-                                        break;
+                                        // The outgoing connection (and its sender) may be temporarily down.
+                                        // Do not tear down the incoming connection here: the outgoing side
+                                        // will be re-established via PersistentConnection, and keeping the
+                                        // incoming connection alive avoids forcing the peer to reconnect
+                                        // its own outgoing connection.
+                                        tracing::info!(
+                                            "Outgoing connection to {} is dead; keeping incoming connection open and relying on reconnect",
+                                            peer_id
+                                        );

		/// Interval between consecutive pings. A new ping is sent 1 second after the previous
		/// ping was sent, but only after receiving its pong response.

	"Received pong {} from {}, expected {}, lost {} pong(s)",
	"Received pong {} from {}, expected {} (gap of {} pong(s))",

		/// Channel for buffering Pongs when outgoing connection is temporarily unavailable.
		pong_buffer: UnboundedSender<u64>,

		// Wait for either a pong response or timeout
		tokio::select! {

feat: implement echo connection check with ping/pong mechanism #1687

Are you sure you want to change the base?

feat: implement echo connection check with ping/pong mechanism #1687

Conversation

pbeza commented Dec 17, 2025

Uh oh!

pbeza commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbeza commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gilcu3 commented Dec 22, 2025

Uh oh!

pbeza commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gilcu3 commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbeza commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbeza commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

DSharifi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pbeza commented Dec 19, 2025 •

edited

Loading

pbeza commented Dec 22, 2025 •

edited

Loading

pbeza commented Dec 22, 2025 •

edited

Loading

gilcu3 commented Dec 22, 2025 •

edited

Loading

pbeza commented Dec 23, 2025 •

edited

Loading

pbeza commented Jan 5, 2026 •

edited

Loading