-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sfptpd fails to recover on FPGA devices restart, leading to service crash #17
Comments
Thanks for this! I'm struggling to work out the exact sequence. I think we would benefit from at least |
Unfortunately, this issue is quite sporadic, so capturing it with |
Ah that's a shame - trace level 4 might not be very practical then outside of a deliberate reproduction attempt as it is very noisy - >=4 is used for trace that can recur and fill up logs... |
Would it help to have runtime control of logging level so it can be boosted when you know a reflashing is imminent? Or would that be too fiddly to take advantage of? |
Environment
The problem
During FPGA NIC flashing, sfptpd sometimes fails to reinitialize its PTP clock synchronization, resulting in a service crash. This happens sporadically, as sfptpd may recover successfully after some flashing sessions while failing in others.
Note: PTP configuration is being applied to a Solarflare device, not one of the restarted FPGA devices.
Expected behavior
sfptpd should detect NIC hotplug events, reinitialize the PHC, and resume clock synchronization after NIC flashing without requiring a manual restart.
Reproduction steps
Logs
Observations
The failure is linked to sfptpd's handling of PHC device removals and re-insertions during device flashing. The assertion failure
"sfptpd_phc_record_step: Assertion 'phc != ((void *)0)' failed"
suggests sfptpd references an uninitialized or null PHC device upon attempting to adjust the clock.Temporary Workaround
Restarting sfptpd after flashing typically resolves the issue, though this is not ideal..
Thanks!
The text was updated successfully, but these errors were encountered: