-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Critical fault #12 flash filesystem corruption and format on nrf52 platform #5839
Comments
Do you use an Android phone to connect to your Heltec? There are a few other bugs that look similar to this one. |
Yes, I use an Android phone. |
Same Error on T-Echo. Logs attached. Use Android with 2.5.16 beta App. ERROR | ??:??:?? 4 Error: can't encode protobuf io error |
T-Echo ←[36m←[0m ←[34m DEBUG ←[0m| 15:53:01 2493 [Router] ←[34m Opening /prefs/db.proto, fullAtomic=0←[0m |
I think i have the same problem (T114 v2), my device is in a boot loop and i can't get it to start at all DEBUG | ??:??:?? 5 Expand short PSK #1 |
From the initial report:
And from trying to reproduce the issue last night (from a RAK WisBlock 4631):
Both have something Bluetooth related happening around the same time as an update to the db.proto file. The Bluetooth library also writes to the file system. https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/master/libraries/Bluefruit52Lib/src/utility/bonding.cpp Could there be some preemption happening that causes a LFS change to happen at the same time as another LFS change is happening? Edit: Another question to investigate: For each bad block detected, does the usable size of the LFS filesystem decrease by a block? Meaning, after a while of this same thing happening, could it become impossible to write new files due to the file system not having enough free blocks? Edit2: Seems like this would prevent simultaneous access from happening: adafruit/Adafruit_nRF52_Arduino#397 Edit3: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/Bluefruit52Lib/src/bluefruit.cpp#L711 The flash operations callback inside the bluefruit library? Edit4: Curious why this semaphore allows up to 10 concurrent accesses instead of just 1? https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L110C12-L110C36 There is some good analysis in adafruit/Adafruit_nRF52_Arduino#350 & littlefs-project/littlefs#352. This seems to be what lead to the current locking solution in Adafruit_LittleFS. |
I've been able to get some debug logs by modifying this code: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/cores/nRF5/common_func.h#L154-L158 And adding #if __cplusplus
extern "C" void logLegacy(const char *level, const char *fmt, ...);
#define PRINTF(...) logLegacy("DEBUG", __VA_ARGS__)
#else
void logLegacy(const char *level, const char *fmt, ...);
#define PRINTF(...) logLegacy("DEBUG", __VA_ARGS__)
#endif I'm not easily able to reproduce the issue though. |
There are definitely some file accesses that happen when the phone connects.
|
@esev It seems like you've got a lot more knowledge than me here, and it seems like there's some good clues in the thread so far. I've been testing around this issue today, and as far as I can tell: it occurs when Bluetooth disconnects unexpectedly during a flash write. The reason for the disconnection seems relevant. A graceful disconnect, initiated by the phone, gives reason:
If this type of disconnection occurs during a flash write, there is no issue. Disconnection types that I've seen cause issues are:
I see One workaround for the void NRF52Bluetooth::shutdown()
{
// Shutdown bluetooth for minimum power draw
LOG_INFO("Disable NRF52 bluetooth");
uint8_t connection_num = Bluefruit.connected();
if (connection_num) {
for (uint8_t i = 0; i < connection_num; i++) {
LOG_INFO("NRF52 bluetooth disconnecting handle %d", i);
Bluefruit.disconnect(i);
}
// Wait for disconnection
while(Bluefruit.connected())
yield();
LOG_INFO("All bluetooth connections ended");
}
Bluefruit.Advertising.stop();
} I haven't found any workaround for
It may depend on the phone used for testing, but here is how I have been able to reproduce the issue:
|
Looks like we are likely going to need to make some modifications to Bluefruit's persistence. I would prefer it to pull more of those things into memory to cut down on file system IO if possible. That's a shoot from the hip answer though, until I dig in more.
@todd-herbert I think that workaround is worth a PR even if it's not a comprehensive fix. |
With the changes described in #5839 (comment), I've grabbed some more detailed logs of what's going on in the
|
That's very helpful @todd-herbert ! Knowing it happens when the phone goes out of range is helpful. Nothing in the bluefruit library accesses LFS at that point so we can potentially rule out "multiple fs access" as being the cause. This may be odd timeout behavior in the hardware. I'll look into this more after work tonight.
The microwave idea and the save on button press should make debugging much easier. |
Looking at the (very helpful) reference links you've collected, I think I might have a working solution, although possibly I'm just being naive.. I'm not sure what the full implications of it would be. It's late here now but I'll try to get a proof of concept pushed tonight so you can see the general idea and check if there's any merit in it. |
Here's the changes I threw together that appear to fix the issue: flash_nrf5x.c
I seem to now be able to trigger the problematic disconnect with no consequences. Not shown here, but with the additional logging setup you described earlier I was able to add extra debug output to confirm that, in the situation shown above, the flash operations do fail, and are successfully reattempted.
That is the big question.. I've replaced it with a binary semaphore with no immediately apparent consequences, but it's a bit over my head so I'm hoping someone more knowledgeable than me will know if there was some very good reason for that counting semaphore. |
Yes! That is the exact same thing (retries) I was going to try. Nice work! I'm so happy to see the issue go away with that change! :) I'll dig-in more to the semaphore. I think your change addresses the root issue. I'll give it a try tonight. |
Finally got a chance to test. This is working perfectly on my device. I can trigger the issue and I see the retry happen and succeed. No more lfs messages! Prior to testing your change, I also added some logging to show how many simultaneous callers passed through the original semaphore. It was only ever one at a time. That matches what I was expecting too, as I think the device can only perform one flash operation at a time and reports NRF_ERROR_BUSY when one is in progress. I like your change to use a binary semaphore. It seems much less surprising to me. |
One thing. I'm not sure it is necessary to https://github.com/littlefs-project/littlefs/blob/v1.7.2/DESIGN.md |
Very cool! I don't think I would have noticed that it was related to Bluetooth :) |
For keeping track of debugging ideas for the future:
|
That is reassuring to know that you had also independently reached the same conclusion about where we should go with this for a fix!
Ahh that would make sense if
That's a very good point. I'd restructured a bit today to cut back on the amount of duplicate code, and in the process had gotten rid of the assert, out of a general fear of unintentionally disrupting some fs behavior. Hopefully there's nothing similar lurking in today's changes too. I've opened it up as draft at meshtastic/Adafruit_nRF52_Arduino#1, and would very much appreciate your feedback if you spot anything we could do better. |
I just got a another LFS_ASSERT crash. Not boot looping this time though. I tried using https://github.com/todd-herbert/meshtastic-nrf52-arduino/tree/reattempt-flash-ops to see if it would bring it back up but it did not fix it. The only way I have been able to rescusitate it i to add FSCom.format() in to the arduino setup function. Then once it has been reformatted on first boot, re-flash with the format line removed. |
When I was inducing that 0x22 ble disconnect failure on shutdown, I would see it eventually end up in that same situation, after I had induced seveal shutdown-corruptions, and the blocks had remapped several times. Maybe there's a clue there? Did you give the device a full erase recently, or has it had a history of these corruption events already? If it does lock up again, one idea might be to enable the extra logging as described by #5839 (comment), to see if there's any interesting info. I imagine you could do this even after it locks up. |
Yes my this device has had a history of this happening. This is the third time I've had to deal with it. First time it was reboot looping because I think the nodedb got too big but I also left it in the house and might have had a the BT distance thing happen. My attempts at full erase didn't the fix the issue but I did try the full erase during the first and second occurences. |
Different device but adding more diagnostic data, this just happened to my Nano G2 a second time (previously a month ago), clearing nodeDB fixed it on a hunch after a few days of "nodeDB-full" warnings on the very populated Boston mesh. Android meshtastic critical fault 12, but this time I caught it before the boot-loop. early symptom is it forgetting the name I set, about two days before the error display and aggressive reconnect loop. after fault 12 is displayed and the reconnection loop starts, when I connect over USB to clear the nodeDB the region is also UNSET. clearing nodeDB un-wedges. it almost feels like swap starvation or something. or a broken flash controller load balancing algorithm, like the counterfeit SDcards that report 256gb but really they just overwrite preexisting blocks after they go over 2gb interestingly, the random pin also was the same pin over and over for every reconnection attempt until I reset the nodeDB |
Edit: This is described much better in #4447
|
I do remember seeing discussion about that over at #4447, so I think you're probably right on the money again. |
The reboot & format works. But I'm still seeing some FS corruption even with the retries. I'll keep digging into this.
|
Do you have any idea as to the source of the corruption? |
Not at all right now. :( |
does it still corrupt if you start with an empty nodeDB? |
@esev @todd-herbert I wonder if we should revisit the seek(0) and truncate approach and switch to more destructive delete, then open for write? We may have introduced a variable here which is muddying the water on the problem. |
Unfortunately, yes. I had been testing the format logic earlier in the day. So it did start from 0 at some point. |
I'll create a build for testing just prior to the changes to SafeFile, and cherry-pick in our retry/format changes too. One theory that I have is that maybe the flash writes are reporting success but a bit or two didn't actually get set correctly. IIUC the flash cache layer should still have a copy of the expected write. I'll add some logic to compare it to what was actually written. LittleFS does this comparison too. But I believe LittleFS is only configured to look at the change of a single 128 byte block. Whereas the flash writes cover an entire 4096 byte flash page. |
I have seen a similar problem on a T-Echo. Seen several fails on one of my T-Echos cause obviously by NodeDB file corruption. During fail the log after reset is like this:
I modified the code to trigger an assert intentionally in lfs_file_open. This results in the following log (two times device reset):
Again, on assert the device is frozen/locked . Next I commented the
Now, while the assert is triggered, we get the After setting up a JLink and debugging the code it seems for this exact problem the code is never going beyond _lockFS() in
So I modfied the format function to use a finite delay:
So it seems my issue is with infinitely blocked _lockFS. [Edit] Removing the intentional assert but keeping the
|
Thanks @Mictronics for that very detailed analysis! Would you be able to test with #5900? |
Sure, let me pull this in and I will report back. [Edit]
|
And another real test of the new fix: My mobile T-Echo failed again. :-( Again during a time when a lot of new nodes were coming in. Device frozen in lockFS(), just after reset:
Installed right away the build with latest HEAD up to 8e8b22e.
So while the actual fix is obviously solving the lock on boot issue the root cause of the NodeDB corruption issue still seem to exist. |
That's a good test. Thanks for that! meshtastic/Adafruit_nRF52_Arduino#1 goes a long way toward addressing the corruption issue. |
Oh wow, caused be BT? I guess I can setup a long term log an play around here. I have two T-Echo on the bench I use for handheld purpose. |
Seems to be. It's a really odd/unexpected interaction between the two.
Thank you so much. This would be really helpful. I've had mine crash a couple times still - but I can't rule out my own testing interfering with something. So far no issues in the last ~20 hours on my latest run. I also added a memcmp in meshtastic/Adafruit_nRF52_Arduino#1 in the loops where the retries are handled. If the memory doesn't read back correctly it also retries. I'm not 100% sure that is needed though - as I haven't hit any corruption since running a build with that extra logic in it. |
But there is a nuance here. I had four different T-Echo failing this week. Out of this four, two were setup in the wild, a third one in the backyard. These three, I am 100% sure, had and still have BT disabled. Only the one handheld, that failed again today, has BT enabled. |
This is a good data point. What firmware version were you running on them all? |
I just got another crash as well. And I hadn't been home, nor connected to the node, for 8+ hours when it crashed. None of the memcmp verification checks I added were ever triggered either. This still wasn't pre-SafeFile changes though. That's my next test and will start running that code tonight. |
Out in the wild something like 2.5.15 not sure. In the backyard 2.5.18 and the handheld one always the latest so 2.5.19 to 2.5.20. All custom build. |
Here are the logs covering my previous 3 crashes. [Edit: These logs were with my build including 85de193 & 1c0f43c] I'm now running a build without 85de193 & 1c0f43c to see if it avoids the issue.
|
My log is running. Let's see what we get. But my guess is a similar or equal error message. |
This is worth investigating: littlefs-project/littlefs#800 / littlefs-project/littlefs#268 |
Category
Other
Hardware
Heltec Mesh Node T114
Firmware Version
2.5.13.1a06f88
Description
Back in November 2024, my T114 showed a Critical fault #12. I rebooted it and it seemed to work OK, but a few days later, it got into a boot loop. The serial debug output was:
Then about a week later, another of my T114s also got into a boot loop. This one doesn't have a screen, so I don't know if it also had a Critical fault #12. I erased the flash, installed 2.5.13.1a06f88, connected the USB port to a PC, and logged all of the serial output, so in case it happened again, I could see what happened right before the first crash and reboot. After about a month, it rebooted:
And after the reboot, these is what it logged before rebooting again:
So it seems that what triggered it was the "Bad block" errors, and maybe the "relocation" code is buggy and corrupts the filesystem? In any case, the "Bad block" errors seem more relevant. From what I see in lfs.c, it looks like the "Bad block" message means a flash routine returned
LFS_ERR_CORRUPT
. However, I didn't see anything in InternalFileSystem.cpp that would returnLFS_ERR_CORRUPT
. So I think that meanslfs_cache_cmp()
returnedfalse
(e.g., line 194 of lfs.c).I haven't looked into the details of how the caching works, but since I don't think the flash is going bad on either of my T114s (V2 hasn't been out that long), I wonder if something else in the firmware is corrupting the memory buffer used for the cache.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: