Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sfptpd fails to recover on FPGA devices restart, leading to service crash #17

Open
sasharozenson opened this issue Oct 25, 2024 · 4 comments

Comments

@sasharozenson
Copy link

sasharozenson commented Oct 25, 2024

Environment

  • OS: Debian 11 ("Bullseye")
  • Kernel: 6.1.60-2
  • sfptpd Version: 3.7

The problem

During FPGA NIC flashing, sfptpd sometimes fails to reinitialize its PTP clock synchronization, resulting in a service crash. This happens sporadically, as sfptpd may recover successfully after some flashing sessions while failing in others.
Note: PTP configuration is being applied to a Solarflare device, not one of the restarted FPGA devices.

Expected behavior

sfptpd should detect NIC hotplug events, reinitialize the PHC, and resume clock synchronization after NIC flashing without requiring a manual restart.


Reproduction steps

  1. Flash FPGA devices (~ 2 minutes).
  2. Monitor sfptpd logs during and after the flashing process.
  3. Check for clock-related errors or assertion failures.

Logs

Oct 24 11:30:01 hostname sfptpd[645550]: interface: handling detected changes: neta (if_index 1270)
Oct 24 11:30:01 hostname sfptpd[645550]: interface: handling detected changes: netb (if_index 1271)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: primnic (if_index 1272)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: bcknic (if_index 1273)
Oct 24 11:30:02 hostname sfptpd[645550]: clock phc3(neta/netb/primnic/bcknic/fnic1_2/fnic1_3): failed to adjust frequency using clock_adjtime(), No such device
Oct 24 11:30:02 hostname sfptpd[645550]: clock phc3(neta/netb/primnic/bcknic/fnic1_2/fnic1_3): error getting system time, No such device
Oct 24 11:30:02 hostname sfptpd[645550]: failed to read clock phc3(neta/netb/primnic/bcknic/fnic1_2/fnic1_3) time, No such device
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: fnic1_2 (if_index 1274)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: fnic1_3 (if_index 1275)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: hotplug remove for fnic0_0 (1268)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: hotplug remove for fnic0_1 (1269)
Oct 24 11:30:04 hostname sfptpd[645550]: no clock frequency correction file /var/lib/sfptpd/freq-correction-0000:0000:0000:0000
Oct 24 11:30:04 hostname sfptpd[645550]: clock (deleted): unable to adjust frequency - no phc device
Oct 24 11:30:04 hostname sfptpd[645550]: clock (deleted): unable to get time - no phc device
Oct 24 11:30:04 hostname sfptpd[645550]: failed to read clock (deleted) time, No such device
Oct 24 11:30:04 hostname sfptpd[645550]: phc3: using diff method SYS_OFFSET
Oct 24 11:30:04 hostname sfptpd[645550]: clock phc3: applying offset 1729745704.394490016 seconds
Oct 24 11:30:04 hostname sfptpd[645550]: **sfptpd: src/sfptpd_phc.c:1327: sfptpd_phc_record_step: Assertion `phc != ((void *)0)' failed.**
Oct 24 11:30:04 hostname systemd[1]: sfptpd.service: Main process exited, code=killed, status=6/ABRT
Oct 24 11:30:04 hostname systemd[1]: sfptpd.service: Failed with result 'signal'.

Observations

The failure is linked to sfptpd's handling of PHC device removals and re-insertions during device flashing. The assertion failure "sfptpd_phc_record_step: Assertion 'phc != ((void *)0)' failed" suggests sfptpd references an uninitialized or null PHC device upon attempting to adjust the clock.

Temporary Workaround

Restarting sfptpd after flashing typically resolves the issue, though this is not ideal..


Thanks!

@abower-amd
Copy link
Collaborator

Thanks for this! I'm struggling to work out the exact sequence. I think we would benefit from at least trace_level 4 (if not higher) if this is reproducible. It looks like the PHC object has been available because we literally just detected the available clock comparison methods available and corrected the clock from epoch so it doesn't seem so straightforward as to what is wrong.

@sasharozenson
Copy link
Author

Unfortunately, this issue is quite sporadic, so capturing it with trace_level 4 might take a bit of time. I'll do my best, but I can’t promise a quick follow-up.

@abower-amd
Copy link
Collaborator

abower-amd commented Oct 25, 2024

Unfortunately, this issue is quite sporadic, so capturing it with trace_level 4 might take a bit of time. I'll do my best, but I can’t promise a quick follow-up.

Ah that's a shame - trace level 4 might not be very practical then outside of a deliberate reproduction attempt as it is very noisy - >=4 is used for trace that can recur and fill up logs...

@abower-amd
Copy link
Collaborator

Would it help to have runtime control of logging level so it can be boosted when you know a reflashing is imminent? Or would that be too fiddly to take advantage of?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants