multi: refresh peer IP during reconnect #5538

ellemouton · 2021-07-17T13:30:23Z

With this PR, we ensure that when a persistent outbound peer changes its IP address we can reconnect to it. This is done by periodically fetching the peer's latest advertised IP address from the DB and updating the connection request to the peer accordingly.

A new itest, reconnect_after_ip_change demonstrates that the original bug and shows how this PR fixes the bug

lntest/itest/assertions.go

lntest/itest/lnd_misc_test.go

routing/router.go

server.go

routing/router.go

bhandras · 2021-07-21T13:45:08Z

server.go

+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	// Check if there are currently any connection requests for this node.


Something I don't fully understand is that isn't this inherently a race? As in if the address changes we only ever connect to that new address if we are right in the middle of connecting to the node anyway?

Ah ok yeah is your question basically 'what happens if we somehow get the node announcement before we run peerTerminationWatcher?

For some reason im having hard time recreating this situation in the itest (cant seem to force the NodeAnnouncement to be received before peerTerminationWatcher has run no matter where i add delays 🤔 )

To address this though, what do you think about:

removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound

also then fetching latest advertised addresses again after the backoff here

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
also then fetching latest advertised addresses again after the backoff here

Sounds good to me.

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

Not sure that I follow. What is the problem if NodeAnnouncement is received after peerTerminationWatcher?

peerTerminationWatcher creates the connection requests that will be used to reconnect to the peer and uses the currently stored addresses for that node. If the peer changes their address, but we only get their NodeAnnouncement that includes the new addresses after we have run peerTerminationWatcher, then the connection requests that we created will have the incorrect addresses

I see. Then I think we need to fetch the node info from db again to get the latest address.

well yes, but those latest addresses could still be old. We could get a NodeAnnouncement after we have done that. That is why I added the: r.cfg.UpdateConnRequests(pk, msg.Addresses) above so that if we do get a NodeAnnouncement after we have already fetched from the db and created connReqs, then things can be updated. I will also add refresh from db in peerTerminationWatcher, but i dont think I can remove the UpdateConnRequests. Thoughts?

I will also add refresh from db in peerTerminationWatcher

Yeah I think this would be good enough. As suggested in this comment, we should handle persistent connections on their own dedicated scope. Down the bottom,connmgr handles the retries, but it does not refresh the address. Since peerTerminationWatcher is a wrapper above connmgr, maybe we could put some logic there to retrieve the updated address?

Unrelated to this PR, I think we need to someday extract all the connection management into its own package, saving us from the long server.go file, put up some nice unit tests, etc.

I have updated the PR and the PR description to explain why i think it is not enough to just refresh from the db in peerTerminationWatcher (the itest also fails if we only do this). Let me know what you think.

ellemouton

Thanks for the review @bhandras :) I have fixed the nits and added a question to your one comment

routing/router.go

ellemouton · 2021-07-22T08:42:28Z

server.go

+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	// Check if there are currently any connection requests for this node.


Ah ok yeah is your question basically 'what happens if we somehow get the node announcement before we run peerTerminationWatcher?

For some reason im having hard time recreating this situation in the itest (cant seem to force the NodeAnnouncement to be received before peerTerminationWatcher has run no matter where i add delays 🤔 )

To address this though, what do you think about:

removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound

also then fetching latest advertised addresses again after the backoff here

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

yyforyongyu

I think modifying peerTerminationWatcher to fetch the latest net addr is good enough. In essence, a peer changing its IP is equivalent to a peer going offline then online. We need to properly handle the case where we need to clean old states (like disconnect peer, remove the links, etc), then attempt to reconnect using the new IP (also means updating states like the address in brontide's config). peerTerminationWatcher has almost done everything already, except fetching the updated net addr.

yyforyongyu · 2021-07-26T04:25:14Z

lntest/itest/assertions.go

-// assertNumConnections asserts number current connections between two peers.
-func assertNumConnections(t *harnessTest, alice, bob *lntest.HarnessNode,
-	expected int) {
+// assertConnected asserts that two peers are connected.


Nice refactor!

lntest/itest/assertions.go

lntest/itest/lnd_misc_test.go

yyforyongyu · 2021-07-26T04:44:42Z

lntest/itest/lnd_misc_test.go

+// that this bug exists and will be changes to assert that peers are able to
+// reconnect in the commit that fixes the bug.
+func testReconnectAfterIPChange(net *lntest.NetworkHarness, t *harnessTest) {
+	// In this test, the following network will be set up. A single


Very nice docs!

yyforyongyu · 2021-07-26T04:54:09Z

lntest/itest/lnd_misc_test.go

+		lntest.OpenChannelParams{
+			Amt: 1000000,
+		},
+	)


nit: we could defer closeChannelAndAssert(ctxt, t, net, net.Alice, chanPoint, false) here for clarity, similar to how we defer shutdownAndAssert(net, t, charlie).

yyforyongyu · 2021-07-26T04:56:34Z

lntest/itest/lnd_misc_test.go

+	// that Bob will send below after changing his listening port will be
+	// younger than the timestamp Bob's NodeAnnouncement sent after the
+	// channel open above.
+	time.Sleep(time.Second)


Sorry not sure that I follow, but why is this needed?🧐

Seems that if the 2 NodeAnnouncements about the same node are received in the same second then the second one is ignored as the node thinks it has received it already

ie: i think the timestamp resolution of the NodeAnnouncement is 1 second. but will double check 👍

lntest/itest/lnd_misc_test.go

yyforyongyu · 2021-07-26T08:38:31Z

routing/router.go

+		// If we currently have any connection requests pending for
+		// this node then update the requests if the node addresses
+		// have changed.
+		r.cfg.UpdateConnRequests(pk, msg.Addresses)


Not sure if router should be handling this as it's already saving the node info to db. Some other service could just read the info from db and apply the necessary changes there.

related to my comment below, yes sure another service could read from the DB but the issue is that the other service would not know when to refresh from the db

Though we haven't strictly defined the overall architecture, routing package should not be worrying about net.Addr. Its main purpose is to route HTLCs over the link layer. If other services don't know when to refresh the db, we could apply a publish-subscribe pattern, or, in this case, might just periodically refresh the db.

cool I have moved this to the gossiper package.

yyforyongyu · 2021-07-26T08:47:59Z

server.go

+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	// Check if there are currently any connection requests for this node.


removing the if p.Inbound() here so that latest addresses are fetched from node announcement no matter if peer was inbound or outbound
also then fetching latest advertised addresses again after the backoff here

Sounds good to me.

we would still need to keep the UpdateConnReqs though for the case where NodeAnnouncement is received after peerTerminationWatcher has run

Not sure that I follow. What is the problem if NodeAnnouncement is received after peerTerminationWatcher?

yyforyongyu · 2021-07-28T13:07:07Z

lntest/itest/assertions.go

-func assertNumConnections(t *harnessTest, alice, bob *lntest.HarnessNode,
-	expected int) {
+// assertConnected asserts that two peers are connected.
+func assertConnected(t *harnessTest, alice, bob *lntest.HarnessNode,


A second thought on this function. It appears to me that this function only asserts that alice is connected to bob, but not the other way around? We might have bob listed in alice's call of ListPeers but no alice in bob's. So we need to list peers for bob too like the original func.

lntest/itest/lnd_misc_test.go

yyforyongyu · 2021-07-28T14:24:13Z

routing/router.go

+		// If we currently have any connection requests pending for
+		// this node then update the requests if the node addresses
+		// have changed.
+		r.cfg.UpdateConnRequests(pk, msg.Addresses)


Though we haven't strictly defined the overall architecture, routing package should not be worrying about net.Addr. Its main purpose is to route HTLCs over the link layer. If other services don't know when to refresh the db, we could apply a publish-subscribe pattern, or, in this case, might just periodically refresh the db.

yyforyongyu · 2021-07-28T14:34:56Z

server.go

+	s.mu.Lock()
+	defer s.mu.Unlock()
+
+	// Check if there are currently any connection requests for this node.


I will also add refresh from db in peerTerminationWatcher

Yeah I think this would be good enough. As suggested in this comment, we should handle persistent connections on their own dedicated scope. Down the bottom,connmgr handles the retries, but it does not refresh the address. Since peerTerminationWatcher is a wrapper above connmgr, maybe we could put some logic there to retrieve the updated address?

Unrelated to this PR, I think we need to someday extract all the connection management into its own package, saving us from the long server.go file, put up some nice unit tests, etc.

ellemouton · 2021-08-06T14:16:59Z

Thanks for the review @yyforyongyu and @bhandras :) I have updated the PR and have written a much more detailed PR description. Please let me know what you guys think :)

yyforyongyu

Thanks for the review @yyforyongyu and @bhandras :) I have updated the PR and have written a much more detailed PR description. Please let me know what you guys think :)

Cool, thanks for the update! Left some comments, and thinking maybe it's easier to just put a loop and retry the connection inside peerTerminationWatcher to avoid potential race, no need to worry about when the NodeAnnouncement arrives, etc.

yyforyongyu · 2021-08-13T10:26:44Z

docs/release-notes/release-notes-0.14.0.md

 currnet DNS seeds when in SigNet
 mode](https://github.com/lightningnetwork/lnd/pull/5564).

+* [A bug has been fixed that would result in nodes not reconnecting to their 
+persistent outbound peers if the peer's IP 
+address changed](https://github.com/lightningnetwork/lnd/pull/5538)


nit: missing a dot in the end

yyforyongyu · 2021-08-13T10:38:03Z

discovery/gossiper.go

+		// If we currently have any connection requests pending for
+		// this node then update the requests if the node addresses
+		// have changed.
+		d.cfg.UpdateConnRequests(pk, msg.Addresses)


Could msg.Addresses be empty? Do we need to call UpdateConnRequests for every node announcement?

yyforyongyu · 2021-08-13T10:38:51Z

server.go

+				// addresses and update the connection request
+				// accordingly.
+				advertisedAddr, err := s.fetchNodeAdvertisedAddr(pubKey)
+				if err == nil {


We need to log the error here if it's not nil.

yyforyongyu · 2021-08-13T10:47:36Z

server.go

@@ -3564,6 +3564,14 @@ func (s *server) peerTerminationWatcher(p *peer.Brontide, ready chan struct{}) {

 			select {
 			case <-time.After(backoff):
+				// Once again attempt to refresh the node


Could we put the whole select inside a for loop, so that we refresh the db to fetch the latest address after every backoff? Seems like a simpler change.

ok cool yeah, good idea 👍

yyforyongyu · 2021-08-13T11:02:31Z

server.go

+		return
+	}
+
+	connSet := make(map[string]bool)


nit: add some docs to explain what is this for.

yyforyongyu · 2021-08-13T11:06:00Z

server.go

+			Permanent: true,
+		}
+
+		s.persistentConnReqs[pubStr] = append(


What about s.persistentPeersBackoff and s.persistentRetryCancels, do we need to update them too?

latest change does update s.persistentPeersBackoff but i dont htink that s.persistentRetryCancels needs to be updated

ellemouton

Thanks for the idea @yyforyongyu :) I have updated 👍

ellemouton · 2021-08-16T10:42:42Z

server.go

@@ -3564,6 +3564,14 @@ func (s *server) peerTerminationWatcher(p *peer.Brontide, ready chan struct{}) {

 			select {
 			case <-time.After(backoff):
+				// Once again attempt to refresh the node


ok cool yeah, good idea 👍

ellemouton · 2021-08-16T12:08:05Z

server.go

+			Permanent: true,
+		}
+
+		s.persistentConnReqs[pubStr] = append(


latest change does update s.persistentPeersBackoff but i dont htink that s.persistentRetryCancels needs to be updated

server.go

bhandras

Changes LGTM, much cleaner in the latest iteration. Great work @ellemouton! 🎉 (Left a few optional nits to consider.)

I think the PR is ready to get merged but could you please check those itest flakes if any of them are related?

bhandras · 2021-08-16T14:59:50Z

lntest/itest/assertions.go

 			)
 		}
-		if len(bNumPeers.Peers) != expected {
+
+		for _, peer := range peers.Peers {


nit: could use less copy-paste by using closures but since it's just test code it's not super important.

good idea, updated 👍

bhandras · 2021-08-16T15:02:00Z

lntest/harness.go

@@ -671,7 +671,7 @@ func (n *NetworkHarness) EnsureConnected(ctx context.Context,
 // NOTE: This function may block for up to 15-seconds as it will not return
 // until the new connection is detected as being known to both nodes.
 func (n *NetworkHarness) ConnectNodes(ctx context.Context, t *testing.T,
-	a, b *HarnessNode) {
+	a, b *HarnessNode, perm bool) {


nit: (no need to change) alternatively we could have ConnectNodes(...) and ConnectNodesPerm(...) both calling to connectNodes(..., perm bool).

bhandras · 2021-08-16T15:04:51Z

lntest/itest/lnd_network_test.go

+		},
+	)
+	defer func() {
+		ctxt, _ = context.WithTimeout(ctxb, channelCloseTimeout)


nit: do you need this context?

lntest/itest/lnd_network_test.go

ellemouton · 2021-09-07T14:13:28Z

@Roasbeef, @yyforyongyu & @bhandras: I have created #5700 as an alternative to this PR. It uses an event driven approach.

ellemouton · 2021-09-27T15:14:19Z

ok, putting #5700 (big refactor) on the back burner again until 0.15 and revamping this PR instead for now.

Roasbeef · 2021-10-01T02:40:29Z

server.go

-				}
+	// We'll only need to re-launch a connection request if one
+	// isn't already currently pending.
+	if _, ok := s.persistentConnReqs[pubStr]; ok {


Solid change here, eliminates a bunch of nesting 👍

Roasbeef

LGTM 💎

One small comment, but looking really good other than that. Excited to proceed w/ that greater refactor in the next major version, but pleased with how this PR progressed in either case.

server.go

Roasbeef · 2021-10-01T02:55:18Z

Needs a rebase!

The assertNumConnection function currently takes in an 'expected' number of connections argument and asserts that both alice and bob only each only have that number of connections. So this fails to be useful if say alice is also connected to charlie cause then if we call assertNumConnections between alice and bob it will fail saying there are 2 connections between them since all it does is count alice's total number of connections. This commit replaces this function with 2 new functions: assertConnected which asserts that at least one connection exists between two peers and assertNotConnected which asserts that no connections exists between the two peers.

This commit adds a ConnectNodesPerm function to the itest NetworkHarness so that persistent connections between nodes can be mocked.

This commit adds an itest to demonstrate that if a peer advertises multiplie external IP addresses, then they will not all be used to reconnect to the peer during reconnection. This will be fixed in a follow-up commit.

The point of this commit is to make future commits in the same PR easier to review. All that this commit does is exit early if the peer we are considering is not persistent instead of having a bunch of logic indented in an if-clause.

This commit just ensures that we fetch the lastest advertised addresses for a peer from the db for both inbound and outbound peers. The reason for seperating this into its own commit is to make future commits in this PR easier to review.

In this commit, all advertised addresses of a peer are used during reconnection. This fixes a bug previously demonstrated in an itest.

In this commit we demonstrate a bug to show that if an inbound peer changes their listening address to one not advertised in their original NodeAnnouncement then we will not be able to reconnect to them. This bug will be fixed in a follow-up commit.

In this commit, a subscription is made to topology updates. For any NodeAnnouncements received for our peristent peers, we store their newly advertised addresses. If at the time of receiving these new addresses there are any existing connection requests, these are updated to reflect the newly advertised addresses.

viaj3ro · 2022-01-02T16:08:09Z

Has this been thoroughly tested? I had another IP address change on my node and around 25 of my tor peers are still disconnected after 40 hours. The new IP was announced right after the change and 1ml picked it up around 15-30 minutes later.

Crypt-iQ added bug fix networking p2p Code related to the peer-to-peer behaviour labels Jul 19, 2021

Roasbeef added this to the v0.14.0 milestone Jul 19, 2021

ellemouton changed the title ~~mutli: refresh peer IP during reconnect~~ multi: refresh peer IP during reconnect Jul 20, 2021

Roasbeef requested review from yyforyongyu and bhandras July 20, 2021 17:39

bhandras reviewed Jul 21, 2021

View reviewed changes

ellemouton force-pushed the refreshPeerIP branch from 90c437a to 534c16b Compare July 22, 2021 08:45

ellemouton commented Jul 22, 2021

View reviewed changes

ellemouton force-pushed the refreshPeerIP branch 2 times, most recently from 8ca183c to def0707 Compare July 22, 2021 09:12

yyforyongyu requested changes Jul 26, 2021

View reviewed changes

yyforyongyu reviewed Jul 28, 2021

View reviewed changes

ellemouton force-pushed the refreshPeerIP branch 2 times, most recently from 23b4993 to 2f3e5b4 Compare August 6, 2021 14:08

ellemouton requested review from bhandras and yyforyongyu August 10, 2021 06:14

yyforyongyu requested changes Aug 13, 2021

View reviewed changes

ellemouton force-pushed the refreshPeerIP branch from 2f3e5b4 to 64d2343 Compare August 16, 2021 12:22

ellemouton commented Aug 16, 2021

View reviewed changes

ellemouton force-pushed the refreshPeerIP branch from 64d2343 to 9000efb Compare August 16, 2021 12:26

ellemouton commented Aug 16, 2021

View reviewed changes

server.go Outdated Show resolved Hide resolved

ellemouton force-pushed the refreshPeerIP branch from 9000efb to cb03a69 Compare August 16, 2021 12:29

ellemouton requested a review from yyforyongyu August 16, 2021 12:30

bhandras approved these changes Aug 16, 2021

View reviewed changes

ellemouton force-pushed the refreshPeerIP branch from cb03a69 to 663f4ab Compare August 17, 2021 08:20

ellemouton mentioned this pull request Sep 7, 2021

multi+refactor: persistent peer manager #5700

Closed

ellemouton force-pushed the refreshPeerIP branch from 034fe95 to 5e986db Compare September 27, 2021 15:08

ellemouton requested a review from bhandras September 27, 2021 15:12

Roasbeef reviewed Oct 1, 2021

View reviewed changes

Roasbeef approved these changes Oct 1, 2021

View reviewed changes

server.go Outdated Show resolved Hide resolved

ellemouton force-pushed the refreshPeerIP branch 2 times, most recently from 9c0b0c1 to 4dcc866 Compare October 1, 2021 06:48

ellemouton added 9 commits October 4, 2021 14:57

lntest: create persistent connection

5177ec2

This commit adds a ConnectNodesPerm function to the itest NetworkHarness so that persistent connections between nodes can be mocked.

lntest: show that multi addresses are not used

18909e1

This commit adds an itest to demonstrate that if a peer advertises multiplie external IP addresses, then they will not all be used to reconnect to the peer during reconnection. This will be fixed in a follow-up commit.

server: exit early if not persistent peer

004ce64

The point of this commit is to make future commits in the same PR easier to review. All that this commit does is exit early if the peer we are considering is not persistent instead of having a bunch of logic indented in an if-clause.

server: fetch addr from db if outbound too

169316e

This commit just ensures that we fetch the lastest advertised addresses for a peer from the db for both inbound and outbound peers. The reason for seperating this into its own commit is to make future commits in this PR easier to review.

server+lntest: use all addrs during reconnect

d639a4d

In this commit, all advertised addresses of a peer are used during reconnection. This fixes a bug previously demonstrated in an itest.

lntest: show reconnection bug

6b5b53d

In this commit we demonstrate a bug to show that if an inbound peer changes their listening address to one not advertised in their original NodeAnnouncement then we will not be able to reconnect to them. This bug will be fixed in a follow-up commit.

docs/release-notes: add new entry for 5538

317acf7

ellemouton force-pushed the refreshPeerIP branch from 4dcc866 to 317acf7 Compare October 4, 2021 12:58

Roasbeef merged commit af9b620 into lightningnetwork:master Oct 4, 2021

ellemouton deleted the refreshPeerIP branch October 5, 2021 05:01

Roasbeef mentioned this pull request Oct 28, 2021

Can't reconnect to TOR nodes after migration #5210

Closed

This was referenced Nov 2, 2021

Auto reconnect fails with some peers #5887

Closed

server: stagger connReqs to multi-address peers #5925

Merged

viaj3ro mentioned this pull request Jan 2, 2022

Peers are still unable to reconnect and stuck in a deadlock after IP address change #6128

Closed

HannahMR removed the P2 should be fixed if one has time label Jan 17, 2022

multi: refresh peer IP during reconnect #5538

multi: refresh peer IP during reconnect #5538

Conversation

ellemouton commented Jul 17, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton Jul 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yyforyongyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton commented Aug 6, 2021

yyforyongyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhandras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton commented Sep 7, 2021

ellemouton commented Sep 27, 2021

Choose a reason for hiding this comment

Roasbeef left a comment

Choose a reason for hiding this comment

Roasbeef commented Oct 1, 2021

viaj3ro commented Jan 2, 2022

ellemouton commented Jul 17, 2021 •

edited

Loading

ellemouton Jul 26, 2021 •

edited

Loading