sfptpd fails to recover on FPGA devices restart, leading to service crash #17

sasharozenson · 2024-10-25T10:41:49Z

Environment

OS: Debian 11 ("Bullseye")
Kernel: 6.1.60-2
sfptpd Version: 3.7

The problem

During FPGA NIC flashing, sfptpd sometimes fails to reinitialize its PTP clock synchronization, resulting in a service crash. This happens sporadically, as sfptpd may recover successfully after some flashing sessions while failing in others.
Note: PTP configuration is being applied to a Solarflare device, not one of the restarted FPGA devices.

Expected behavior

sfptpd should detect NIC hotplug events, reinitialize the PHC, and resume clock synchronization after NIC flashing without requiring a manual restart.

Reproduction steps

Flash FPGA devices (~ 2 minutes).
Monitor sfptpd logs during and after the flashing process.
Check for clock-related errors or assertion failures.

Logs

Oct 24 11:30:01 hostname sfptpd[645550]: interface: handling detected changes: neta (if_index 1270)
Oct 24 11:30:01 hostname sfptpd[645550]: interface: handling detected changes: netb (if_index 1271)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: primnic (if_index 1272)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: bcknic (if_index 1273)
Oct 24 11:30:02 hostname sfptpd[645550]: clock phc3(neta/netb/primnic/bcknic/fnic1_2/fnic1_3): failed to adjust frequency using clock_adjtime(), No such device
Oct 24 11:30:02 hostname sfptpd[645550]: clock phc3(neta/netb/primnic/bcknic/fnic1_2/fnic1_3): error getting system time, No such device
Oct 24 11:30:02 hostname sfptpd[645550]: failed to read clock phc3(neta/netb/primnic/bcknic/fnic1_2/fnic1_3) time, No such device
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: fnic1_2 (if_index 1274)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: handling detected changes: fnic1_3 (if_index 1275)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: hotplug remove for fnic0_0 (1268)
Oct 24 11:30:02 hostname sfptpd[645550]: interface: hotplug remove for fnic0_1 (1269)
Oct 24 11:30:04 hostname sfptpd[645550]: no clock frequency correction file /var/lib/sfptpd/freq-correction-0000:0000:0000:0000
Oct 24 11:30:04 hostname sfptpd[645550]: clock (deleted): unable to adjust frequency - no phc device
Oct 24 11:30:04 hostname sfptpd[645550]: clock (deleted): unable to get time - no phc device
Oct 24 11:30:04 hostname sfptpd[645550]: failed to read clock (deleted) time, No such device
Oct 24 11:30:04 hostname sfptpd[645550]: phc3: using diff method SYS_OFFSET
Oct 24 11:30:04 hostname sfptpd[645550]: clock phc3: applying offset 1729745704.394490016 seconds
Oct 24 11:30:04 hostname sfptpd[645550]: **sfptpd: src/sfptpd_phc.c:1327: sfptpd_phc_record_step: Assertion `phc != ((void *)0)' failed.**
Oct 24 11:30:04 hostname systemd[1]: sfptpd.service: Main process exited, code=killed, status=6/ABRT
Oct 24 11:30:04 hostname systemd[1]: sfptpd.service: Failed with result 'signal'.

Observations

The failure is linked to sfptpd's handling of PHC device removals and re-insertions during device flashing. The assertion failure "sfptpd_phc_record_step: Assertion 'phc != ((void *)0)' failed" suggests sfptpd references an uninitialized or null PHC device upon attempting to adjust the clock.

Temporary Workaround

Restarting sfptpd after flashing typically resolves the issue, though this is not ideal..

Thanks!

The text was updated successfully, but these errors were encountered:

abower-amd · 2024-10-25T11:56:53Z

Thanks for this! I'm struggling to work out the exact sequence. I think we would benefit from at least trace_level 4 (if not higher) if this is reproducible. It looks like the PHC object has been available because we literally just detected the available clock comparison methods available and corrected the clock from epoch so it doesn't seem so straightforward as to what is wrong.

sasharozenson · 2024-10-25T13:32:55Z

Unfortunately, this issue is quite sporadic, so capturing it with trace_level 4 might take a bit of time. I'll do my best, but I can’t promise a quick follow-up.

abower-amd · 2024-10-25T13:37:01Z

Unfortunately, this issue is quite sporadic, so capturing it with trace_level 4 might take a bit of time. I'll do my best, but I can’t promise a quick follow-up.

Ah that's a shame - trace level 4 might not be very practical then outside of a deliberate reproduction attempt as it is very noisy - >=4 is used for trace that can recur and fill up logs...

abower-amd · 2024-11-01T13:10:42Z

Would it help to have runtime control of logging level so it can be boosted when you know a reflashing is imminent? Or would that be too fiddly to take advantage of?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sfptpd fails to recover on FPGA devices restart, leading to service crash #17

sfptpd fails to recover on FPGA devices restart, leading to service crash #17

sasharozenson commented Oct 25, 2024 •

edited

Loading

abower-amd commented Oct 25, 2024

sasharozenson commented Oct 25, 2024

abower-amd commented Oct 25, 2024 •

edited

Loading

abower-amd commented Nov 1, 2024

sfptpd fails to recover on FPGA devices restart, leading to service crash #17

sfptpd fails to recover on FPGA devices restart, leading to service crash #17

Comments

sasharozenson commented Oct 25, 2024 • edited Loading

Environment

The problem

Expected behavior

Reproduction steps

Logs

Observations

Temporary Workaround

abower-amd commented Oct 25, 2024

sasharozenson commented Oct 25, 2024

abower-amd commented Oct 25, 2024 • edited Loading

abower-amd commented Nov 1, 2024

sasharozenson commented Oct 25, 2024 •

edited

Loading

abower-amd commented Oct 25, 2024 •

edited

Loading