Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply host DNS settings on peer state change #2291

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

hurricanehrndz
Copy link
Contributor

@hurricanehrndz hurricanehrndz commented Jul 19, 2024

Describe your changes

This is a PoC and a potential fix #2002. This patch is not full fleshed out and their are some unattended side effects, but I wanted to get your thoughts and opinions @mlsmaycon and @pascal-fischer and @lixmal.

More Info:

This essential tries to ensure that host dns changes aren't applied til at least one routing peer is connected and DNS servers are reachable. It does so by tying deactivation and reactivation of upstream servers to Peerlist state changes in the status recorder.

As it stands, now status -d doesn't accurately reflect the status on first connection. That is because I haven't figure out an effective method to deactivate the servers without more intense changes. Primarily because of how the deactivate and reactivate callbacks are structure and their dependence of the removeIndex. One potential workaround is to modify deactivate to skip applying host config, so that we can fix this side effect.

Thoughts and discussion please

Issue ticket number and link

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)
  • Extended the README / documentation, if necessary

@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch 2 times, most recently from ec6228a to 21f265a Compare July 19, 2024 14:58
@mlsmaycon mlsmaycon self-assigned this Aug 1, 2024
@mlsmaycon mlsmaycon requested a review from lixmal August 1, 2024 15:11
@lixmal
Copy link
Contributor

lixmal commented Aug 2, 2024

This essential tries to ensure that host dns changes aren't applied til at least one routing peer is connected and DNS servers are reachable.

This approach would break non-routed dns servers (e.g. 8.8.8.8 without exit node), correct?

@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Aug 7, 2024

This essential tries to ensure that host dns changes aren't applied til at least one routing peer is connected and DNS servers are reachable.

This approach would break non-routed dns servers (e.g. 8.8.8.8 without exit node), correct?

Not at all, it would just not apply settings until the client is connected to one peer. I am going to try and actually formulate this to a PR

I may update the activate and deactivate upstream handlers to take a flag to skip applying system settings so as fix the DNS updates from management.

@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from 21f265a to 09778a6 Compare August 15, 2024 21:38
@hurricanehrndz
Copy link
Contributor Author

@lixmal this is ready not sure if perhaps this should be gated under a config option and the default would be maintain the current behavior. Test and let me know what you think

@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Aug 16, 2024

Actually I got an idea to make a hybrid model, let me get that going instead

@LeszekBlazewski
Copy link

Looking forward for this fix.

If I could help in any way to test this (for example on a setup I described in #2002 (comment)) let me know.

@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from 09778a6 to 7b40359 Compare August 16, 2024 16:03
@hurricanehrndz
Copy link
Contributor Author

Okay so this will waitforresponse and trigger probes base when a peer connstatus changes

@hurricanehrndz
Copy link
Contributor Author

@lixmal and @mlsmaycon can we get a build so @LeszekBlazewski can provide some feedback

@lixmal
Copy link
Contributor

lixmal commented Aug 16, 2024

You can find the binaries in the artifacts in this PR: https://github.com/netbirdio/netbird/actions/runs/10422794281

@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from 508030c to 365f6b7 Compare August 19, 2024 14:11
@LeszekBlazewski
Copy link

Hey guys,

I have run some tests with the latest binaries built in this PR.

My setup:

  1. private R53 hosted zone with a configured inbound resolver (3 ips in range 10.3.X.X, only configured to match specific domains). Those 3 IP addresses are only accessible within VPC which runs a netbird peer. For the whole 10.3.0.0/16 range I have a network route configured which points all that traffic to the peer which runs inside that VPC and therefore is allowed to use the resolver.
  2. Imaginary DNS nameserver (ip 1.2.3.4, match all domains) that won't ever resolve
  3. Just for testing sake, the cloudflare and google DNS resolver (match all domains)
  4. Exit node setup

After comparing the current behaviour (netbird 0.28.7) vs the built binaries, I have found out that indeed the introduced changes postpone DNS servers evaluation until a routing peer is connected which does potentially address the issue mentioned in #2002 ... unless the nameserver is not supposed to be handled by that connected peer. Based on the logs, I have observed that nameserver probing is triggered regardless of the peer that got just connected ( I guess it's assumed that all configured netbird DNS servers can be handled by all connected peers or there are other technical limitations which prevent figuring out such information, as described below).

For example in my described setup, it would make sense to probe the nameservers from point 1 only if the routing peer which handles 10.3.X.X/16 would be connected. I am aware that such condition might be challenging to meet since there is no peer assignment for DNS resolution handling (only for distribution). Also the public DNS resolution (Google and Cloudflare) make the decision wether to probe / don't probe DNS even more harder since no one configures network routes for those DNS servers (there is a reason why they are public).

But still, I think that not applying the config when the DNS server is not reachable is key here and that seems to work. Moreover the backoff over time will fix the above mentioned issue and all the DNS servers who are reachable will be connected.

@hurricanehrndz
Copy link
Contributor Author

Hey guys,

I have run some tests with the latest binaries built in this PR.

My setup:

1. private R53 hosted zone with a configured inbound resolver (3 ips in range `10.3.X.X`, only configured to match specific domains). Those 3 IP addresses are only accessible within VPC which runs a netbird peer. For the whole 10.3.0.0/16 range I have a network route configured which points all that traffic to the peer which runs inside that VPC and therefore is allowed to use the resolver.

2. Imaginary DNS nameserver (ip 1.2.3.4, match all domains) that won't ever resolve

3. Just for testing sake, the cloudflare and google DNS resolver (match all domains)

4. Exit node setup

After comparing the current behaviour (netbird 0.28.7) vs the built binaries, I have found out that indeed the introduced changes postpone DNS servers evaluation until a routing peer is connected which does potentially address the issue mentioned in #2002 ... unless the nameserver is not supposed to be handled by that connected peer. Based on the logs, I have observed that nameserver probing is triggered regardless of the peer that got just connected ( I guess it's assumed that all configured netbird DNS servers can be handled by all connected peers or there are other technical limitations which prevent figuring out such information, as described below).

For example in my described setup, it would make sense to probe the nameservers from point 1 only if the routing peer which handles 10.3.X.X/16 would be connected. I am aware that such condition might be challenging to meet since there is no peer assignment for DNS resolution handling (only for distribution). Also the public DNS resolution (Google and Cloudflare) make the decision wether to probe / don't probe DNS even more harder since no one configures network routes for those DNS servers (there is a reason why they are public).

But still, I think that not applying the config when the DNS server is not reachable is key here and that seems to work. Moreover the backoff over time will fix the above mentioned issue and all the DNS servers who are reachable will be connected.

I made this patch to show what is one potential fix, probing base on routes when a peer connects will get more complicated, but it is possible.

This patch still includes the old logic as fallback, so triggering if public DNS servers are configured they should get apply very early on

@hurricanehrndz
Copy link
Contributor Author

Hey guys,

I have run some tests with the latest binaries built in this PR.

My setup:

1. private R53 hosted zone with a configured inbound resolver (3 ips in range `10.3.X.X`, only configured to match specific domains). Those 3 IP addresses are only accessible within VPC which runs a netbird peer. For the whole 10.3.0.0/16 range I have a network route configured which points all that traffic to the peer which runs inside that VPC and therefore is allowed to use the resolver.

2. Imaginary DNS nameserver (ip 1.2.3.4, match all domains) that won't ever resolve

3. Just for testing sake, the cloudflare and google DNS resolver (match all domains)

4. Exit node setup

After comparing the current behaviour (netbird 0.28.7) vs the built binaries, I have found out that indeed the introduced changes postpone DNS servers evaluation until a routing peer is connected which does potentially address the issue mentioned in #2002 ... unless the nameserver is not supposed to be handled by that connected peer. Based on the logs, I have observed that nameserver probing is triggered regardless of the peer that got just connected ( I guess it's assumed that all configured netbird DNS servers can be handled by all connected peers or there are other technical limitations which prevent figuring out such information, as described below).

For example in my described setup, it would make sense to probe the nameservers from point 1 only if the routing peer which handles 10.3.X.X/16 would be connected. I am aware that such condition might be challenging to meet since there is no peer assignment for DNS resolution handling (only for distribution). Also the public DNS resolution (Google and Cloudflare) make the decision wether to probe / don't probe DNS even more harder since no one configures network routes for those DNS servers (there is a reason why they are public).

But still, I think that not applying the config when the DNS server is not reachable is key here and that seems to work. Moreover the backoff over time will fix the above mentioned issue and all the DNS servers who are reachable will be connected.

I will try and add the ability to trigger a peer with route. What I gather though was that the App was performing and behaving better apply DNS changes base on when peers connect, is that right?

@LeszekBlazewski
Copy link

Yeah, for sure. With the patch from this PR the amount of timed out calls to private NS is minimised.

One thing I wanted to figure out and I am wondering if this could be related to my setup somehow is: Based on the logs, even after the peer which handles 10.3.X.X traffic connects (logs indicate that), the first few calls to probe the private NS still fail and only after few seconds they come back online. I will try to investigate this since I am confident that those NS are available at all times and it could be related to the exit node node or the routing peer connecting as relay.

@hurricanehrndz
Copy link
Contributor Author

Yeah, for sure. With the patch from this PR the amount of timed out calls to private NS is minimised.

One thing I wanted to figure out and I am wondering if this could be related to my setup somehow is: Based on the logs, even after the peer which handles 10.3.X.X traffic connects (logs indicate that), the first few calls to probe the private NS still fail and only after few seconds they come back online. I will try to investigate this since I am confident that those NS are available at all times and it could be related to the exit node node or the routing peer connecting as relay.

Do you have any system extensions that filter DNS requests, such as Cisco Umbrella? Assuming this is a macOS device

@LeszekBlazewski
Copy link

LeszekBlazewski commented Aug 21, 2024

The only thing running in the background is eset with the Network Access Protection enabled which might do some DNS filtering (I am not sure since I don't have access to the management of this antivirus but I was told it does not). I will make sure to disable it when testing.

What is interesting is that the NS requests fail only initially and then just work without me doing any changes.

@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from 365f6b7 to 2196243 Compare August 21, 2024 19:48
@hurricanehrndz
Copy link
Contributor Author

The only thing running in the background is eset with the Network Access Protection enabled which might do some DNS filtering (I am not sure since I don't have access to the management of this antivirus but I was told it does not). I will make sure to disable it when testing.

What is interesting is that the NS requests fail only initially and then just work without me doing any changes.

This patch now support enabling publicly accessible servers at processing the DNS update time and/or start

@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from aa5c94f to f7c8dc2 Compare August 22, 2024 18:10
@hurricanehrndz hurricanehrndz changed the title Poc dns on peer state change Apply host DNS settings on peer state change Aug 22, 2024
@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Aug 22, 2024

@LeszekBlazewski

Minor updates to behavior in order to ensure publicly accessible servers work from the start without a peer. I unfortunately didn't filter probing base on network. So test this one out. I will see what I can do about that front, but this would need to be more heavily refactored since I would need to start another network monitor

@LeszekBlazewski
Copy link

Indeed, did a quick test and the public nameservers were enabled almost instantly whereas the rest got processed later on. I am doing some checks on why the first initial few nameserver requests after the connection to the peer which is supposed to handle that traffic still fail.

@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from f7c8dc2 to 443ce47 Compare August 29, 2024 13:46
@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from 583f0c0 to 30e9cb7 Compare September 3, 2024 14:34
This patch adds an additional DNS probe trigger by ConnStatus changes to
any peer. This is an attempt to limit the number of DNS changes applied
to Darwin, Windows and BSD hosts. There should be no change whatsoever
on how iOS and Android operate.

How publicly accessible DNS servers get applied to the host should also
remain unchanged.
Verison 29+ introduce new functions for the built-in relay that notify
when peer state has changed. Similarly for this feature to work, there
are handlers that need to close the aPeerConnStatusChanged channel to
signal a change.
@hurricanehrndz hurricanehrndz force-pushed the poc_DNS_on_peer_state_change branch from 30e9cb7 to 814369b Compare September 23, 2024 14:58
@hurricanehrndz
Copy link
Contributor Author

@LeszekBlazewski can you test the latest build, the patch has been updated to work with the latest 29.x releases. I believe @mlsmaycon will be reviewing this soon

https://github.com/netbirdio/netbird/actions/runs/10997072510?pr=2291

@LeszekBlazewski
Copy link

LeszekBlazewski commented Sep 23, 2024 via email

Copy link

sonarqubecloud bot commented Oct 1, 2024

…te_change

* upstream/main: (81 commits)
  Fix cached device flow oauth (netbirdio#2833)
  Avoid failing all other matrix tests if one fails (netbirdio#2839)
  add all group to add peer affected peers network map check (netbirdio#2830)
  [client] Log windows panics (netbirdio#2829)
  Fix unused servers cleanup (netbirdio#2826)
  [management] Add DB access duration to logs for context cancel (netbirdio#2781)
  Allocate new buffer for every package (netbirdio#2823)
  [client] Nil check on ICE remote conn (netbirdio#2806)
  [management] remove network map diff calculations (netbirdio#2820)
  Create FUNDING.yml (netbirdio#2814)
  Create funding.json (netbirdio#2813)
  [management] add metrics to network map diff (netbirdio#2811)
  [client] Fix the broken dependency gvisor.dev/gvisor (netbirdio#2789)
  fix meta is equal slices (netbirdio#2807)
  [client] Fix multiple peer name filtering in netbird status command (netbirdio#2798)
  [management] Setup key improvements (netbirdio#2775)
  [client] allow relay leader on iOS (netbirdio#2795)
  [client] Remove legacy forwarding rules in userspace mode (netbirdio#2782)
  [client] Ignore route rules with no sources instead of erroring out (netbirdio#2786)
  [misc] Update Zitadel from v2.54.10 to v2.64.1
  ...
@hurricanehrndz
Copy link
Contributor Author

@LeszekBlazewski this now applies settings base on routing table

@Eu-Arthur
Copy link

i have same problem, like #2002, and this PR fix my problem.

But, it's possible to be rebase to last version ?

@hurricanehrndz
Copy link
Contributor Author

hurricanehrndz commented Nov 29, 2024 via email

…te_change

* upstream/main: (55 commits)
  [client] Account different policiy rules for routes firewall rules (netbirdio#2939)
  Add guide when signing key is not found (netbirdio#2942)
  [tests] Enable benchmark tests on github actions (netbirdio#2961)
  [management] Add performance test for login and sync calls (netbirdio#2960)
  [management] refactor to use account object instead of separate db calls for peer update (netbirdio#2957)
  [client] Code cleaning in net pkg and fix exit node feature on Android(netbirdio#2932)
  [management] Refactor nameserver groups to use store methods (netbirdio#2888)
  [management] Refactor DNS settings to use store methods (netbirdio#2883)
  [management] Refactor policy to use store methods (netbirdio#2878)
  [management] Refactor posture check to use store methods (netbirdio#2874)
  [client] Allow routing to fallback to exclusion routes if rules are not supported (netbirdio#2909)
  [client] Set up sysctl and routing table name only if routing rules are available (netbirdio#2933)
  [client] Test nftables for incompatible iptables rules (netbirdio#2948)
  [client] Don't return error in userspace mode without firewall (netbirdio#2924)
  Import time package (netbirdio#2940)
  [misc] Renew slack link (netbirdio#2938)
  [relay] Refactor initial Relay connection (netbirdio#2800)
  [management] Fix getSetupKey call (netbirdio#2927)
  [client] Fix allow netbird rule verdict (netbirdio#2925)
  [management] Add activity events to group propagation flow (netbirdio#2916)
  ...
@hurricanehrndz
Copy link
Contributor Author

i have same problem, like #2002, and this PR fix my problem.

But, it's possible to be rebase to last version ?

I have updated this PR, but it is a WIP, please test. I need update this PR to support other platforms

@Eu-Arthur
Copy link

Yes, i don't see problem, for me that work on latest version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Netbird should connect to peers before setting up DNS
5 participants