radvd: improve stale prefix cleanup #9261#10261
radvd: improve stale prefix cleanup #9261#10261max-foss wants to merge 1 commit intoopnsense:masterfrom
Conversation
| <Default>1</Default> | ||
| </dns> | ||
| <MinRtrAdvInterval type="IntegerField"> | ||
| <Default>200</Default> |
There was a problem hiding this comment.
I'm not going to accept this fundamental change. I also object to additional state files just to fix radvd behaviour.
There was a problem hiding this comment.
Please reconsider your decision.
There was a problem hiding this comment.
I suppose I'm asking for an argument and the reason why this PR does multiple things, which is not a good plan.
There was a problem hiding this comment.
The radvd default
I strongly considered forking radvd, but it is pretty stateless in and of itself and we would purposefully want to persist a reboot. Storing into a file on disk is not my idea, the RFC refers to How/where would you be ok with it being stored the most? https://datatracker.ietf.org/doc/html/rfc9096#section-3.5 I have been dealing with the issue of not properly deprecated prefixes in OPNsense and downstream consequences of that for 5 years now, I really like OPNsense the way it is and would like to see a fix happen. I am open to fully doing it your way, but please accept that 24h PPPoE lifetime with dynamic IPv6 prefixes is a reality for many and that there is no workaround using the existing options. Devices tend to fall back to IPv4 via happy eyeballs or have intermittent connectivity issues, this means one can forget about running an IPv6-only network this way. |
But wouldn't a change to 198 require a single line and a mildly good reason to do so? 200/600 have been the project defaults forever. But maybe I'm missing something vital. Regarding everything else: There was a lot of work done in 26.1 to add lifetimes to prefixes in ifconfig output. It should in theory throw out the bad prefix and advertise the new one? So my question is what is the current wrong config in What you're looking at is ISP->PPPoE->DHCPv6->Radvd and we're starting at the conclusion. Cheers, |
I can certainly split the two concerns in two PRs, very reasonable requirement. 🙂 |
4fee2ec to
1aebf3c
Compare
|
What I formerly referred to as Part B, the possibility to have an empty I will not edit the posts above to avoid confusion. |
|
#10261 (comment) is largely still relevant before attempting to review |
Well, currently we get exactly one RA with the desired lifetime 0 for the now-deprecated prefix on shutdown/restart, which is easily missed by devices in standby on Wi-Fi, for example. This is expected to be insufficient and hence why RFC 9096 requires persisting that notice for at least the timespan of the previous lifetime. This PR is building upon the existing work, and literally queries ifconfig to use the exposed |
|
Well, it’s an ok argument to say radvd behaviour is not sufficient. We often appear to be the “weak spot” where users pinned between stubborn ISPs and inflexible daemons ask for fundamental changes. And we have to be realistic in what we can offer. If you are using the new lifetime readings, why do we need a json file? I feel like most of this is over-complication. Cheers, |
|
The statefile is an essential part of the patch, because we want to clearly track which prefixes to announce as deprecated with 0 lifetime, it gets cleaned up automatically and is superior to in-memory storage both because of daemon-restart as well as reboot-safety and is also less of a black box without the overhead of creating additional tooling. Compared to even the default level of logging, it only adds very little additional disk activity. That said, I could cut ~200 lines of code out which tie the various lifetimes to prefix lifetime, which is only a best practice I wanted to uplift the entire radvd-"plugin" to perfection in one go, I can understand why you dislike the approach. |
|
Just a pragmatic question, what about this? That way clients will never receive the full lifetime of the prefix, but a reduced one that should reflect the current validity. I didnt check if radvd already has this configured or if its valid in OPNsense context. |
|
One of the main challenges with 24-hour disconnects is that the ISP resets the DHCPv6 server and worst case reshuffles the SLAAC address, which makes stable connectivity a "downstream challenge". Basically they are guaranteeing lifetimes they are not going to uphold. Everything after that fact is less than "perfection" and it should be very clear we can't reach a fully satisfactory state in any case. Cheers, |
|
Just to share how it looks with this fix applied on macOS which is one of the two platforms causing the most pain without. If the deprecating RA with Unfortunately, macOS is stupid enough but not technically violating any RFC that it doesn't try the prefix with the highest And on the contrary, my Android phones dislike a short I understand that Windows and Desktop Linux users may not run into the ill effects of this problem, because their IPv6 stack is slightly saner. I am just trying to make OPNsense behave the same way as an out-of-the-box FRITZ!Box and freshly installed OpenWrt do. In Germany, Deutsche Telekom customers are affected much less often with 180 day+ lifetime and a disconnect often during the day when devices have a higher chance of not being present or not being asleep, while for me on both o2 and 1&1 DSL with a 24h auto-reconnect I tend to push to 4 a.m. via a cron-job, the issue is unfortunately a regular occurrence, no matter what. |
|
If the provider is broken, why shouldn't the provider fix it? Pretty sure they totally disregard RIPE recommendations about prefix assignments and also RFCs about it. But complaining to them (provider) doesnt do much. |
|
Maybe they do by fudging RAs like proposed here, but these don't reach the downstream clients over DHCPv6. The whole idea of DHCPv6 PD delegation on top of link-local RA (including all routing) is a difficult one. |
You have a point here, there is a chance that the upstream values aren't being sane anyway in which case we would be gaining absolutely nothing from the additional logic. The minimal shape would be, starting with the new WAN IPv6 prefix event the previous/now-deprecated prefix gets advertised with Would you be more comfortable with aiming to keep tge state in the radvd.conf, always pushing the last valid entry that is now gone to |
|
In OPNsense we have a strong preference to the "primary" IPv6 address, which is now also the one with the longest remaining lifetime see c015b71 So this means:
One thing to keep in mind is that when adding more prefixes into radvd.conf they are going to force a restart due to the checksumming we do in order to avoid restarts when not necessary. The issue isn't as big perhaps and can be mitigated by a setting as proposed here, but in practice it would be nicer to do this without a setting as the harm of a well-laid out scope is practically zero. Cheers, |
It is the job of the CPE/router responsible for the delegated client /64 to advertise the fact that an older upstream prefix is no longer working after a new one is obtained. A PPPoE reconnect and DHCPv6-PD with a new prefix is a pretty normal and clear signal for that. I agree that dynamic IPv6 prefixes without PPPoE or another clear signal that the old prefix is no longer valid can be pretty messy depending on the exact ISP behaviour, but that is not what I am trying to address here. Anyway, I know for a fact that FRITZ!Box and various other plastic routers advertise the stale prefix as deprecated correctly, including vanilla OpenWrt, while that behaviour is unfortunately less common on enterprise gear where the support for dynamic IPv6 prefixes is pretty bad anyway, also in terms of firewall rules, unlike in OPNsense where such firewalling works pretty well already. |
|
Can you show how OpenWRT does it? I think it uses odhcpd as RA server, correct? There should be some hints in the code there how it handles that, or orchestrates it. Also issues like these sound interesting:
|
This sounds like we are finally getting somewhere, in such a case setting preferred lifetime However, in my case of a full-blown PPPoE-reconnect the old prefix is no longer attached to the interface at all, which is a strong signal we should advertise both Maybe we can come up with an elegant way to hook it into the natural flow, but I really prefer not to touch dhcp6c for this. |
I'm expecting the same here.
But it's ideal and we're already actively maintaining it. One thing I thought about is that instead of removing the prefixes we could set them to The other thing to keep in mind is whether we're setting lifetimes in radvd that extend over the end of the actual prefix lifetime. If we do that we should probably fix that first now that we can... |
|
Wouldn't |
Good idea on paper, not helpful at all in practice for all the cases I have seen, at least. Currently, the inferred lifetime values from upstream/ifconfig are not synced into radvd at all and doing so would require similar, if not more patch surface than my initial rejected PR shape. Also, ISPs are not really providing sane values for either static or dynamic IPv6 PD. I get from 1&1 a preferred lifetime of 48h and valid lifetime of 72h, which far exceeds the 24h reconnect forced on the PPPoE side of things and would be far too long to blindly mirror into RA. The opposite case would also hurt: Static or semi-static IPv6 prefixes distributed over DHCPv6 PD can have a much shorter lifetime than what they behave like in practice, unnecessarily putting certain devices at risk of losing IPv6 connectivity. My last two cents: Regarding such German ISP, we need to ask, what would a FRITZ!Box do, in this case exactly the continued prefix lifetime 0 RA's we are missing here. Internal implementation details of that are proprietary and not publicly known, though. Making IPv6 work is a marathon, not a sprint and OPNsense has come a long way, all of the patches I have seen look pretty good. Cheers Max |
Answering to myself just for the record here's a POC of what I mean opnsense/dhcp6c@5245521a5341 |
It is currently only very indirectly touching the radvd path and my existing concerns are still somewhat standing. Let me summarise this into checklists. Minimal requirements:
Requirements for it to be actually decent:
Do you really intend to use the kernel as a state file replacement here? Such a prefix should not be used for anything else anyway, hence there is no reason to keep it around. How about going the OpenWrt route, keeping the state file on In general: Please let me know if there are any additional resources that could help you here? For example a GNS3 demo or more input from others like @thomasschaeferm, who is rather well known for his efforts around mitigating IPv6 pitfalls in various systems? Thanks Max |
Yes that would be opnsense/dhcp6c@5245521a5341 now it's in ifconfig until you reboot or vltime expires.
Sure, no objections.
Not needed with the proposal.
IMO this is scope creep for questionable benefit.
Sure, why not. It's there since 26.1.
ifconfig is pretty clear
why not
But you wanted to keep it around ;)
Not persistent across reboots and needs additional plumbing. Also you're blurring responsibility between dhcp6c/core/radvd by making it a joint effort. Prefix lifetime management strictly belongs to dhcp6c alone. Cheers, |
So, we explicitly skip the non-mandatory persistence across reboots and the state can be inspected using Do you still want to go without a toggle to turn it off, aligning our future behaviour with both OpenWrt not exposing a toggle and FRITZ!Box not exposing any configuration option at all? It looks like we finally have an idea about an acceptable shape. 🎉 I will have to familiarise myself with the dhcp6c code base and build process in the meantime, so don't expect a revision of this PR to be ready today. Cheers Max |
It can be added later through other means like dumping ifconfig output to a file but I don't expect it to be much needed. On system crashes we can't trust the output anyway. Normal scheduled reboots should not be as problematic (also due to doing clean shutdowns).
No option makes the most sense. Since this fixes edge cases you'd want this opt-out but I don't see much of a reason to do opt-out. In CARP setups extra care may have to be taken but we'll see. Here's how to install the proposal branch in a running OPNsense in under a minute unattended: (best use a reboot to activate the new binary, reconfigure won't restart it) Cheers, |
Important notices
Before you submit a pull request, we ask you kindly to acknowledge the following:
If AI was used, please disclose:
I take full responsibility for any and all LOC in this patch. I am happy with how it works, the shape of the code, the comments, and of course the small UI/UX tweaks.
Describe the problem
Part A: Clients losing IPv6 connectivity overnight on ISPs with nightly PPPoE reconnects. -> see #9261.
Part B: When I want to tweak
RDNSS,Source Addressor any other value underServices: Router Advertisements, I have to configureMinimum intervalandMaximum intervaland can't just leave them blank to trust the radvd defaults like I do when keepingAllow manual adjustment of DHCPv6 and Router Advertisementsunchecked or withDefault Lifetime,Preferred LifetimeandValid Lifetimein the same menu.Describe the proposed solution
Part A: We track the deprecated prefixes in
/var/db/opnsense/radvd_stale_prefixes.jsonand template them in, following RFC 9096 and the other relevant RFCs to the letter.Part B: We allow
Minimum intervalandMaximum intervalto be kept blank, showing the value that will be used when not filled in as a greyed-out hint.Related issue
Part A: #9261
Part B: I did not open a separate issue because it is a small, obvious and related change touching the same files.
GUI demos
Part A:

Part B:

Install guide (use at your own risk):
The changed files in this patch are fully compatible with OPNsense 26.1.7_1, which means we can just drop them in.
On the OPNsense:
pkg install rsyncmkdir backup_before_radvd_patch/cd backup_before_radvd_patch/cp -p /usr/local/etc/inc/plugins.inc.d/radvd.inc /usr/local/opnsense/mvc/app/controllers/OPNsense/Radvd/forms/dialogEntry.xml /usr/local/opnsense/mvc/app/models/OPNsense/Radvd/Radvd.php /usr/local/opnsense/mvc/app/models/OPNsense/Radvd/Radvd.xml /usr/local/opnsense/mvc/app/views/OPNsense/Radvd/settings.volt .On your local machine, in a clone of my branch:
rsync -rlpt --relative --chown=root:wheel --chmod=F644,D755 src/./etc/inc/plugins.inc.d/radvd.inc src/./opnsense/mvc/app/controllers/OPNsense/Radvd/forms/dialogEntry.xml src/./opnsense/mvc/app/models/OPNsense/Radvd/Radvd.php src/./opnsense/mvc/app/models/OPNsense/Radvd/Radvd.xml src/./opnsense/mvc/app/views/OPNsense/Radvd/settings.volt root@opnsense.example:/usr/local/On the OPNsense:
/usr/local/sbin/pluginctl -s radvd restartI am running it right now and am happy to confirm correct behaviour on all fronts.
This can easily be validated after a PPPoE reconnect (and obtaining a new dynamic IPv6 prefix via DHCPv6) by checking
/var/etc/radvd.confon the OPNsense whether the previous prefixes are listed asAdvPreferredLifetime 0andAdvValidLifetime 0or by using tcpdump on either the OPNsense or any other device on the network.