Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Critical fault #12 flash filesystem corruption and format on nrf52 platform #5839

Open
dahanc opened this issue Jan 13, 2025 · 58 comments
Labels
bug Something isn't working

Comments

@dahanc
Copy link
Contributor

dahanc commented Jan 13, 2025

Category

Other

Hardware

Heltec Mesh Node T114

Firmware Version

2.5.13.1a06f88

Description

Back in November 2024, my T114 showed a Critical fault #⁠12. I rebooted it and it seemed to work OK, but a few days later, it got into a boot loop. The serial debug output was:

INFO  | ??:??:?? 6 Adding node to database with 1 nodes and 147308 bytes free!
DEBUG | ??:??:?? 6 Expanding short PSK #1
INFO  | ??:??:?? 6 Wanted region 1, using US
INFO  | ??:??:?? 6 Saving /prefs/db.proto
lfs warn:314: No more free space 224
ERROR | ??:??:?? 6 Error: can't encode protobuf io error
ERROR | ??:??:?? 6 LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | ??:??:?? 6 LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | ??:??:?? 6 LFS assert: block < lfs->cfg

Then about a week later, another of my T114s also got into a boot loop. This one doesn't have a screen, so I don't know if it also had a Critical fault #⁠12. I erased the flash, installed 2.5.13.1a06f88, connected the USB port to a PC, and logged all of the serial output, so in case it happened again, I could see what happened right before the first crash and reboot. After about a month, it rebooted:

INFO  | 20:29:31 2159141 [Router] Received nodeinfo from=0xa20a11c8, id=0x7d3b35e6, portnum=4, payloadlen=44
INFO  | 20:29:31 2159141 [Router] Node database full with 80 nodes and 109096 bytes free. Erasing oldest entry
INFO  | 20:29:31 2159141 [Router] Adding node to database with 80 nodes and 109096 bytes free!
DEBUG | 20:29:31 2159141 [Router] old user /, channel=0
DEBUG | 20:29:31 2159141 [Router] Update changed=1 user MKTX G2 Gateway/MKG3, channel=0
DEBUG | 20:29:31 2159141 [Router] State: ON
DEBUG | 20:29:31 2159141 [Router] Node status update: 22 online, 80 total
INFO  | 20:29:31 2159141 [Router] Save /prefs/db.proto
lfs debug:617: Bad block at 126
lfs debug:617: Bad block at 180
INFO  | 20:29:36 2159146 [Router] BLE Disconnected, reason = 0x8
DEBUG | 20:29:36 2159146 [Router] PhoneAPI::close()
lfs debug:640: Relocating 126 160 to 181 160
ERROR | 20:29:37 2159147 [Router] LFS assert: head >= 2 && head <= lfs->cfg->block_count
ERROR | 20:29:37 2159147 [Rou

And after the reboot, these is what it logged before rebooting again:

INFO  | ??:??:?? 3 Adding node to database with 1 nodes and 153800 bytes free!
DEBUG | ??:??:?? 4 Expand short PSK #1
INFO  | ??:??:?? 4 Wanted region 1, using US
DEBUG | ??:??:?? 4 Coerce telemetry to min of 30 minutes on defaults
INFO  | ??:??:?? 4 Save /prefs/db.proto
ERROR | ??:??:?? 4 LFS assert: block < lfs->cfg->block_cou

So it seems that what triggered it was the "Bad block" errors, and maybe the "relocation" code is buggy and corrupts the filesystem? In any case, the "Bad block" errors seem more relevant. From what I see in lfs.c, it looks like the "Bad block" message means a flash routine returned LFS_ERR_CORRUPT. However, I didn't see anything in InternalFileSystem.cpp that would return LFS_ERR_CORRUPT. So I think that means lfs_cache_cmp() returned false (e.g., line 194 of lfs.c).

I haven't looked into the details of how the caching works, but since I don't think the flash is going bad on either of my T114s (V2 hasn't been out that long), I wonder if something else in the firmware is corrupting the memory buffer used for the cache.

Relevant log output

No response

@dahanc dahanc added the bug Something isn't working label Jan 13, 2025
@esev
Copy link
Contributor

esev commented Jan 13, 2025

Do you use an Android phone to connect to your Heltec? There are a few other bugs that look similar to this one.

@dahanc
Copy link
Contributor Author

dahanc commented Jan 13, 2025

Yes, I use an Android phone.

@daniel-hofer
Copy link

Same Error on T-Echo. Logs attached. Use Android with 2.5.16 beta App.

techolog.txt

ERROR | ??:??:?? 4 Error: can't encode protobuf io error
ERROR | ??:??:?? 4 Can't write prefs!
ERROR | ??:??:?? 4 Failed to save to disk, retrying
DEBUG | ??:??:?? 5 Opening /prefs/config.proto, fullAtomic=1
INFO | ??:??:?? 5 Save /prefs/config.proto
DEBUG | ??:??:?? 5 Opening /prefs/channels.proto, fullAtomic=1
INFO | ??:??:?? 5 Save /prefs/channels.proto
DEBUG | ??:??:?? 6 Opening /prefs/db.proto, fullAtomic=0
INFO | ??:??:?? 6 Save /prefs/db.proto
ERROR | ??:??:?? 6 NOTE! Record critical error 12 at src/mesh/NodeDB.cpp:1220

@liquidraver
Copy link

T-Echo
2.5.19 with ~10 nodes in the nodeDB
full wipe before install:

←[36m←[0m ←[34m DEBUG ←[0m| 15:53:01 2493 [Router] ←[34m Opening /prefs/db.proto, fullAtomic=0←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:01 2493 [Router] ←[32m Save /prefs/db.proto←[0m
←[0m lfs warn:313: No more free space 160←[0m
←[0m lfs warn:313: No more free space 160←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:02 2494 [Router] ←[31m Error: can't encode protobuf io error←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:02 2494 [Router] ←[31m Can't write prefs!←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:02 2494 [Router] ←[31m Failed to save to disk, retrying←[0m
←[0m ←[34m lfs debug:617: Bad block at 1←[0m
←[0m lfs warn:625: Superblock 1 has become unwritable←[0m
←[0m ←[34m lfs debug:617: Bad block at 0←[0m
←[0m lfs warn:625: Superblock 0 has become unwritable←[0m
←[0m ←[31m lfs error:493: Corrupted dir pair at 0 1←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:05 2497 [Router] ←[34m Opening /prefs/db.proto, fullAtomic=0←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:06 2498 [Router] ←[32m Save /prefs/db.proto←[0m
←[0m ←[34m lfs debug:1414: Bad block at 102617196←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:06 2498 [Router] ←[31m LFS assert: block < lfs->cfg->block_count←[0m
←[31m←[0m ←[31m ERROR ←[0m| 15:53:06 2498 [Router] ←[31m Trying FSCom.format()←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:06 2498 [Router] ←[34m Send module config: audio←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:11 2503 [Router] ←[34m Send module config: remote hardware←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:20 2512 [Router] ←[34m Send module config: neighbor info←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:22 2514 [Router] ←[34m Send module config: ambient lighting←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:24 2516 [Router] ←[34m Send module config: detection sensor←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:26 2518 [Router] ←[34m Send module config: paxcounter←[0m
←[36m←[0m ←[34m DEBUG ←[0m| 15:53:29 2521 [Router] ←[34m Send known nodes←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:29 2521 [Router] ←[32m nodeinfo: num=0x75bd7aac, lastseen=0, id=!75bd7aac, name=Meshtastic 7aac←[0m
←[37m←[0m ←[32m INFO ←[0m| 15:53:35 2527 [Router] ←[32m BLE Disconnected, reason = 0x8←[0m
WARNING file:stream_interface.py __reader line:207 Meshtastic serial port disconnected, disconnecting... ClearCommError failed (PermissionError(13, 'The device does not recognize the command.', None, 22))

@Ztrby
Copy link

Ztrby commented Jan 14, 2025

I think i have the same problem (T114 v2), my device is in a boot loop and i can't get it to start at all

DEBUG | ??:??:?? 5 Expand short PSK #1
INFO | ??:??:?? 5 Wanted region 3, using EU_868
DEBUG | ??:??:?? 5 Coerce telemetry to min of 30 minutes on defaults
INFO | ??:??:?? 6 Save /prefs/db.proto
ERROR | ??:??:?? 6 LFS assert: block < lfs->cfg->block_count
ERROR | ??:??:?? 6 Trying FSCom.format()

@esev
Copy link
Contributor

esev commented Jan 15, 2025

From the initial report:

INFO  | 20:29:31 2159141 [Router] Save /prefs/db.proto
lfs debug:617: Bad block at 126
lfs debug:617: Bad block at 180
INFO  | 20:29:36 2159146 [Router] BLE Disconnected, reason = 0x8
DEBUG | 20:29:36 2159146 [Router] PhoneAPI::close()
lfs debug:640: Relocating 126 160 to 181 160

And from trying to reproduce the issue last night (from a RAK WisBlock 4631):

INFO  | 01:48:53 1293 [Router] Adding node to database with 80 nodes and 117812 bytes free!
DEBUG | 01:48:53 1293 [Router] Update changed=1 user North Fremont - Lakes&Birds Neigh/NFMT, id=0xe0f79ab8, channel=1
DEBUG | 01:48:53 1293 [Router] Node status update: 34 online, 80 total
lfs debug:617: Bad block at 94
lfs debug:617: Bad block at 178
INFO  | 01:48:55 1295 [Router] BLE Disconnected, reason = 0x8
DEBUG | 01:48:55 1295 [Router] PhoneAPI::close()
lfs debug:639: Relocating 94 93 to 179 93
INFO  | 01:48:56 1296 [Router] Save /prefs/db.proto

Both have something Bluetooth related happening around the same time as an update to the db.proto file.

The Bluetooth library also writes to the file system. https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/master/libraries/Bluefruit52Lib/src/utility/bonding.cpp

Could there be some preemption happening that causes a LFS change to happen at the same time as another LFS change is happening?

Edit: Another question to investigate: For each bad block detected, does the usable size of the LFS filesystem decrease by a block? Meaning, after a while of this same thing happening, could it become impossible to write new files due to the file system not having enough free blocks?

Edit2: Seems like this would prevent simultaneous access from happening: adafruit/Adafruit_nRF52_Arduino#397

Edit3: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/Bluefruit52Lib/src/bluefruit.cpp#L711 The flash operations callback inside the bluefruit library?

Edit4: Curious why this semaphore allows up to 10 concurrent accesses instead of just 1? https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L110C12-L110C36 There is some good analysis in adafruit/Adafruit_nRF52_Arduino#350 & littlefs-project/littlefs#352. This seems to be what lead to the current locking solution in Adafruit_LittleFS.

@esev
Copy link
Contributor

esev commented Jan 15, 2025

I've been able to get some debug logs by modifying this code: https://github.com/geeksville/Adafruit_nRF52_Arduino/blob/4f591d0f71f75e5128fab9dc42ac72f1696cf89f/cores/nRF5/common_func.h#L154-L158

And adding -DCFG_DEBUG=2 -DCFG_LOGGER=0 to build_flags

#if __cplusplus
  extern "C" void logLegacy(const char *level, const char *fmt, ...);
  #define PRINTF(...)    logLegacy("DEBUG", __VA_ARGS__)
#else
  void logLegacy(const char *level, const char *fmt, ...);
  #define PRINTF(...)    logLegacy("DEBUG", __VA_ARGS__)
#endif

I'm not easily able to reproduce the issue though.

@esev
Copy link
Contributor

esev commented Jan 15, 2025

There are definitely some file accesses that happen when the phone connects.

[BLE   ] BLE_GAP_EVT_CONNECTED : Conn Handle = 0#
[GAP   ] MAC = 69:26:0F:XX:XX:XX, Type = 2, Resolved = 0#
[GAP   ] Conn Interval = 48.75 ms, Latency = 0, Supervisor Timeout = 5000 ms#
[BLE   ] BLE_GATTC_EVT_CHAR_VAL_BY_UUID_READ_RSP : Conn Handle = 0#
INFO  | 05:28:49 236 BLE Connected to Pixel 8
[BLE   ] BLE_GAP_EVT_CONN_PARAM_UPDATE : Conn Handle = 0#
[GAP   ] Conn Interval= 7.50 ms, Latency = 0, Supervisor Timeout = 5000 ms#
[BLE   ] BLE_GAP_EVT_CONN_PARAM_UPDATE : Conn Handle = 0#
[GAP   ] Conn Interval= 30.00 ms, Latency = 0, Supervisor Timeout = 2000 ms#
[BLE   ] BLE_GAP_EVT_SEC_INFO_REQUEST : Conn Handle = 0#
[PAIR  ] Address: ID = 0, Type = 0x02, 93:0D:4B:XX:XX:XX#
[BOND  ] Loaded keys from file /adafruit/bond_prph/XXXXXXXXXXX#
[BLE   ] BLE_GAP_EVT_CONN_SEC_UPDATE : Conn Handle = 0#
[BOND  ] Loaded CCCD from file /adafruit/bond_prph/XXXXXXXXX ( offset = 90, len = 32 bytes )#
[PAIR  ] Security Mode = 1, Level = 3#
INFO  | 05:28:50 237 BLE connection secured
[BLE   ] BLE_GATTS_EVT_EXCHANGE_MTU_REQUEST : Conn Handle = 0#
[GAP   ] ATT MTU is changed to 247#
[BLE   ] BLE_GATTS_EVT_WRITE : Conn Handle = 0#
[GATTS ] attr's cccd = 0x0001#
[BOND  ] CCCD matches file /adafruit/bond_prph/XXXXXXXXXX contents, no need to write#
[BLE   ] BLE_GAP_EVT_DATA_LENGTH_UPDATE_REQUEST : Conn Handle = 0#
[GAP   ] Data Length Request is (tx, rx) octets = (251, 251), (tx, rx) time = (17040, 17040) us#
[BLE   ] BLE_GAP_EVT_DATA_LENGTH_UPDATE : Conn Handle = 0#
[GAP   ] Data Length is (tx, rx) octets = (251, 251), (tx, rx) time = (2120, 2120) us#
[BLE   ] BLE_GATTS_EVT_RW_AUTHORIZE_REQUEST : Conn Handle = 0#
[BLE   ] BLE_GATTS_EVT_WRITE : Conn Handle = 0#
[GATTS ] attr's value, uuid = 0x76D2, len = 2#

@todd-herbert
Copy link
Contributor

todd-herbert commented Jan 15, 2025

@esev It seems like you've got a lot more knowledge than me here, and it seems like there's some good clues in the thread so far. I've been testing around this issue today, and as far as I can tell: it occurs when Bluetooth disconnects unexpectedly during a flash write.

The reason for the disconnection seems relevant.

A graceful disconnect, initiated by the phone, gives reason:

0x13 BLE_HCI_REMOTE_USER_TERMINATED_CONNECTION

If this type of disconnection occurs during a flash write, there is no issue.

Disconnection types that I've seen cause issues are:

0x08 BLE_HCI_CONNECTION_TIMEOUT
0x22 BLE_HCI_STATUS_CODE_LMP_RESPONSE_TIMEOUT

I see 0x08 when the device moves out of range.
I see 0x22 when the device is disconnected on NRF52Bluetooth::shutdown.
If either of these disconnections occur during a flash write, lfs reports bad blocks, and corrupts the flash data.

One workaround for the0x22 seems to be to modify NRF52Bluetooth::shutdown, to poll for disconnection before continuing.
This doesn't get to the root of the issue, but does seems to reliably solve this one case.

void NRF52Bluetooth::shutdown()
{
    // Shutdown bluetooth for minimum power draw
    LOG_INFO("Disable NRF52 bluetooth");
    uint8_t connection_num = Bluefruit.connected();
    if (connection_num) {
        for (uint8_t i = 0; i < connection_num; i++) {
            LOG_INFO("NRF52 bluetooth disconnecting handle %d", i);
            Bluefruit.disconnect(i);
        }

        // Wait for disconnection
        while(Bluefruit.connected())
            yield();
        LOG_INFO("All bluetooth connections ended");
    }
    Bluefruit.Advertising.stop();
}

I haven't found any workaround for 0x08 disconnections so far. It would be nice if there is a true solution to be found somewhere deeper down in Bluefruit / the softdevice, but that's probably beyond my skill level.


I'm not easily able to reproduce the issue though.

It may depend on the phone used for testing, but here is how I have been able to reproduce the issue:

  • Heltec T114 (V1)
  • Android app 2.5.16
  • No existing config required (erased before flash)

0x22

0x08

  • Modify Fix NRF52 default append write mode of files #5858 firmware: add nodeDB->saveToDisk() to ButtonThread.cpp, so that user button pretty triggers flash write
  • Pair phone via Bluetooth
  • Place phone in unplugged microwave to shield RF (sophisticated.. I know)
  • Node powered by laptop, for portable logging.
  • Bring the node very near the closed microwave, allowing Bluetooth to automatically reconnect.
  • Move away from the microwave. A short distance (~3m seems to be enough in my case).
  • Before the Bluetooth connection is lost, press the user button, triggering the flash write. This seems to occur within several seconds, so it is fairly easy to time this without repeatedly pressing the button.

This is all a bit unorthodox, but using this method I seem to be able to repeatedly reproduce the logs from #5839 (comment)

@thebentern
Copy link
Contributor

There are definitely some file accesses that happen when the phone connects.

Looks like we are likely going to need to make some modifications to Bluefruit's persistence. I would prefer it to pull more of those things into memory to cut down on file system IO if possible. That's a shoot from the hip answer though, until I dig in more.

One workaround for the0x22 seems to be to modify NRF52Bluetooth::shutdown, to poll for disconnection before continuing.
This doesn't get to the root of the issue, but does seems to reliably solve this one case.

@todd-herbert I think that workaround is worth a PR even if it's not a comprehensive fix.

@todd-herbert
Copy link
Contributor

With the changes described in #5839 (comment), I've grabbed some more detailed logs of what's going on in the BLE_HCI_CONNECTION_TIMEOUT (0x08) scenario (disconnect out of range). Not sure if there's anything to be made of it, but here's the info:

DEBUG | 12:14:45 87 [Power] Battery: usbPower=0, isCharging=0, batMv=4159, batPct=97
DEBUG | 12:14:47 89 [Button] Opening /prefs/config.proto, fullAtomic=1
INFO  | 12:14:47 89 [Button] Save /prefs/config.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:617: Bad block at 30
[SOC   ] NRF_EVT_FLASH_OPERATION_ERROR#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:617: Bad block at 29
[BLE   ] BLE_GAP_EVT_DISCONNECTED : Conn Handle = 0#
[GAP   ] Disconnect Reason: CONNECTION_TIMEOUT #
INFO  | 12:14:49 91 [Button] BLE Disconnected, reason = 0x8
DEBUG | 12:14:49 91 [Button] PhoneAPI::close()
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:639: Relocating 30 5 to 35 5
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:2526: Found desync 35 5
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:49 92 [Button] Opening /prefs/module.proto, fullAtomic=1
INFO  | 12:14:49 92 [Button] Save /prefs/module.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:50 92 [Button] Opening /prefs/channels.proto, fullAtomic=1
INFO  | 12:14:50 92 [Button] Save /prefs/channels.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:50 93 [Button] Opening /prefs/db.proto, fullAtomic=0
INFO  | 12:14:50 93 [Button] Save /prefs/db.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
DEBUG | 12:14:51 93 [Button] State: ON
bool BLEAdvertising::_start(uint16_t, uint16_t): 377: verify failed, error = NRF_ERROR_INVALID_STATE#
DEBUG | 12:15:05 107 [Power] Battery: usbPower=0, isCharging=0, batMv=4148, batPct=97

@esev
Copy link
Contributor

esev commented Jan 15, 2025

That's very helpful @todd-herbert ! Knowing it happens when the phone goes out of range is helpful. Nothing in the bluefruit library accesses LFS at that point so we can potentially rule out "multiple fs access" as being the cause.

This may be odd timeout behavior in the hardware. I'll look into this more after work tonight.

The microwave idea and the save on button press should make debugging much easier.

@todd-herbert
Copy link
Contributor

todd-herbert commented Jan 15, 2025

Looking at the (very helpful) reference links you've collected, I think I might have a working solution, although possibly I'm just being naive.. I'm not sure what the full implications of it would be. It's late here now but I'll try to get a proof of concept pushed tonight so you can see the general idea and check if there's any merit in it.

@todd-herbert
Copy link
Contributor

todd-herbert commented Jan 15, 2025

Here's the changes I threw together that appear to fix the issue: flash_nrf5x.c

DEBUG | 16:17:19 43 [Button] Opening /prefs/config.proto, fullAtomic=1
INFO  | 16:17:19 43 [Button] Save /prefs/config.proto
DEBUG | 16:17:20 44 [Button] Opening /prefs/module.proto, fullAtomic=1
INFO  | 16:17:20 44 [Button] Save /prefs/module.proto
DEBUG | 16:17:20 44 [Button] Opening /prefs/channels.proto, fullAtomic=1
INFO  | 16:17:20 44 [Button] Save /prefs/channels.proto
INFO  | 16:17:25 49 [Button] BLE Disconnected, reason = 0x8
DEBUG | 16:17:25 49 [Button] PhoneAPI::close()
DEBUG | 16:17:25 49 [Button] Opening /prefs/db.proto, fullAtomic=0
INFO  | 16:17:25 49 [Button] Save /prefs/db.proto
DEBUG | 16:17:25 49 [Button] State: ON
DEBUG | 16:17:25 49 [Power] Battery: usbPower=0, isCharging=0, batMv=4172, batPct=98
INFO  | 16:17:25 49 [DeviceTelemetry] Send: air_util_tx=0.002611, channel_utilization=0.156667, battery_level=98, voltage=4.172000, uptime=49

I seem to now be able to trigger the problematic disconnect with no consequences. Not shown here, but with the additional logging setup you described earlier I was able to add extra debug output to confirm that, in the situation shown above, the flash operations do fail, and are successfully reattempted.

Edit4: Curious why this semaphore allows up to 10 concurrent accesses instead of just 1?

That is the big question.. I've replaced it with a binary semaphore with no immediately apparent consequences, but it's a bit over my head so I'm hoping someone more knowledgeable than me will know if there was some very good reason for that counting semaphore.

@esev
Copy link
Contributor

esev commented Jan 15, 2025

Yes! That is the exact same thing (retries) I was going to try. Nice work! I'm so happy to see the issue go away with that change! :)

I'll dig-in more to the semaphore. I'm kind of wondering if everything behaves the same if the max value is decreased to 1 (oh! you effectively did this too). Maybe only one is actually permitted at a time and the changes happen in series by waiting for err != NRF_ERROR_BUSY?

I think your change addresses the root issue. I'll give it a try tonight.

@esev
Copy link
Contributor

esev commented Jan 16, 2025

Finally got a chance to test. This is working perfectly on my device. I can trigger the issue and I see the retry happen and succeed. No more lfs messages!

Prior to testing your change, I also added some logging to show how many simultaneous callers passed through the original semaphore. It was only ever one at a time. That matches what I was expecting too, as I think the device can only perform one flash operation at a time and reports NRF_ERROR_BUSY when one is in progress. I like your change to use a binary semaphore. It seems much less surprising to me.

@esev
Copy link
Contributor

esev commented Jan 16, 2025

One thing. I'm not sure it is necessary to assert inside the flash operations. In theory LittleFS should detect the failure and relocate the failing blocks. It might be okay to give it a chance to do that.

https://github.com/littlefs-project/littlefs/blob/v1.7.2/DESIGN.md

@dahanc
Copy link
Contributor Author

dahanc commented Jan 16, 2025

Very cool! I don't think I would have noticed that it was related to Bluetooth :)

@esev
Copy link
Contributor

esev commented Jan 16, 2025

For keeping track of debugging ideas for the future:

  • I ended up adding a switch (c) statement inside this if statement:
    if (c != START1)
    rxPtr = 0; // failed to find framing
    I used the 'c' character over serial to trigger nodeDB->saveToDisk(). That could be easily extended to trigger other events when debugging as well.
  • A faraday bag (https://www.amazon.com/gp/product/B0BYM9VJMR) works well to trigger the Bluetooth timeout.

@todd-herbert
Copy link
Contributor

Yes! That is the exact same thing (retries) I was going to try

That is reassuring to know that you had also independently reached the same conclusion about where we should go with this for a fix!

Maybe only one is actually permitted at a time and the changes happen in series by waiting for err != NRF_ERROR_BUSY?

Prior to testing your change, I also added some logging to show how many simultaneous callers passed through the original semaphore. It was only ever one at a time. That matches what I was expecting too, as I think the device can only perform one flash operation at a time and reports NRF_ERROR_BUSY when one is in progress.

Ahh that would make sense if NRF_ERROR_BUSY blocks it that way. For what it's worth, I didn't see any instances of NRF_ERROR_BUSY in the wild while testing the proto saving, but I imagine it's totally possible that the Meshtastic firmware just doesn't trigger that sort of a simultaneous write situation.

One thing. I'm not sure it is necessary to assert inside the flash operations. In theory LittleFS should detect the failure and relocate the failing blocks. It might be okay to give it a chance to do that.

That's a very good point. I'd restructured a bit today to cut back on the amount of duplicate code, and in the process had gotten rid of the assert, out of a general fear of unintentionally disrupting some fs behavior. Hopefully there's nothing similar lurking in today's changes too. I've opened it up as draft at meshtastic/Adafruit_nRF52_Arduino#1, and would very much appreciate your feedback if you spot anything we could do better.

@tavdog
Copy link
Contributor

tavdog commented Jan 17, 2025

I just got a another LFS_ASSERT crash. Not boot looping this time though. I tried using https://github.com/todd-herbert/meshtastic-nrf52-arduino/tree/reattempt-flash-ops to see if it would bring it back up but it did not fix it. The only way I have been able to rescusitate it i to add FSCom.format() in to the arduino setup function. Then once it has been reformatted on first boot, re-flash with the format line removed.

@todd-herbert
Copy link
Contributor

When I was inducing that 0x22 ble disconnect failure on shutdown, I would see it eventually end up in that same situation, after I had induced seveal shutdown-corruptions, and the blocks had remapped several times. Maybe there's a clue there?

Did you give the device a full erase recently, or has it had a history of these corruption events already?

If it does lock up again, one idea might be to enable the extra logging as described by #5839 (comment), to see if there's any interesting info. I imagine you could do this even after it locks up.

@tavdog
Copy link
Contributor

tavdog commented Jan 17, 2025

When I was inducing that 0x22 ble disconnect failure on shutdown, I would see it eventually end up in that same situation, after I had induced seveal shutdown-corruptions, and the blocks had remapped several times. Maybe there's a clue there?

Did you give the device a full erase recently, or has it had a history of these corruption events already?

If it does lock up again, one idea might be to enable the extra logging as described by #5839 (comment), to see if there's any interesting info. I imagine you could do this even after it locks up.

Yes my this device has had a history of this happening. This is the third time I've had to deal with it. First time it was reboot looping because I think the nodedb got too big but I also left it in the house and might have had a the BT distance thing happen. My attempts at full erase didn't the fix the issue but I did try the full erase during the first and second occurences.

@NireBryce
Copy link

NireBryce commented Jan 18, 2025

Different device but adding more diagnostic data, this just happened to my Nano G2 a second time (previously a month ago), clearing nodeDB fixed it on a hunch after a few days of "nodeDB-full" warnings on the very populated Boston mesh.

Android meshtastic

critical fault 12, but this time I caught it before the boot-loop.

early symptom is it forgetting the name I set, about two days before the error display and aggressive reconnect loop.

after fault 12 is displayed and the reconnection loop starts, when I connect over USB to clear the nodeDB the region is also UNSET. clearing nodeDB un-wedges.

it almost feels like swap starvation or something. or a broken flash controller load balancing algorithm, like the counterfeit SDcards that report 256gb but really they just overwrite preexisting blocks after they go over 2gb

interestingly, the random pin also was the same pin over and over for every reconnection attempt until I reset the nodeDB

@esev
Copy link
Contributor

esev commented Jan 18, 2025

Edit: This is described much better in #4447

I've been digging a little deeper into the InternalFileSystem implementation. I'm not sure it is setup to work best with LittleFS. Normally LittleFS should be able to detect a failed flash erase/write and recover from that. But I think the InternalFileSystem implementation breaks that assumption.

Reason: The LittleFS documentation specifies that block_size should match the size of an erasable block. That is so LittleFS can read the block contents and compare them with its cache after writing to ensure they were written correctly. InternalFileSystem configures the LittleFS block_size to 128 bytes, while the smallest flash erasable size is actually 4096 bytes. So when/if a flash erase/write fails 4096 bytes of the filesystem could be lost, but LittleFS is only able to detect and recover 128 bytes from that block.

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/4dcfa3b7960a6dafec6736a5dbea255b0946be83/libraries/InternalFileSytem/src/InternalFileSystem.cpp#L35

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/4dcfa3b7960a6dafec6736a5dbea255b0946be83/libraries/InternalFileSytem/src/InternalFileSystem.cpp#L109-L113

https://github.com/littlefs-project/littlefs/blob/7e110b44c0e796dc56e2fe86587762d685653029/lfs.h#L146-L150

Am I reading this correctly?

I think retrying the flash writes (meshtastic/Adafruit_nRF52_Arduino#1) is going to help a lot here. But if there is an actual bad flash page, I think LittleFS may never be able to relocate around the issue due to the way it is configured in InternalFileSystem. It may also make the wear-leveling algorithm in LittleFS less effective as LittleFS is unaware of the true physical blocks that are being erased.

Note: I'm not sure there is enough storage on the NRF52 devices to make block_size match the actual flash block size. So this may explain why it is configured this way. But I think this configuration breaks some of the assumptions that LittleFS depends upon for detecting and recovering from bad flash pages.

@todd-herbert
Copy link
Contributor

I do remember seeing discussion about that over at #4447, so I think you're probably right on the money again.

@esev
Copy link
Contributor

esev commented Jan 21, 2025

The reboot & format works. But I'm still seeing some FS corruption even with the retries. I'll keep digging into this.

INFO  | 02:36:30 35671 [Router] Node database full with 80 nodes and 124884 bytes free. Erasing oldest entry
INFO  | 02:36:30 35671 [Router] Adding node to database with 80 nodes and 124884 bytes free!
DEBUG | 02:36:30 35671 [Router] Update changed=1 user Tekify Fiber & Wireless/TEK, id=0x384a22ab, channel=0
DEBUG | 02:36:30 35671 [Router] Node status update: 49 online, 80 total
DEBUG | 02:36:30 35671 [Router] Opening /prefs/db.proto, fullAtomic=0
INFO  | 02:36:30 35671 [Router] Save /prefs/db.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:19: Assert lfs_cache_read 68842597
ERROR | 02:36:31 35672 [Router] LittleFS corruption detected: block < lfs->cfg->block_count
INFO  | 02:36:31 35672 [Router] Rebooting to format LittleFS
INFO  | 02:36:31 35672 [Router] NRF_POWER=0x40000000
<94><C3>^B@^AINFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 1 LittleFS format complete; restoring default settings
DEBUG | ??:??:?? 1 Filesystem files:

@todd-herbert
Copy link
Contributor

Do you have any idea as to the source of the corruption?

@esev
Copy link
Contributor

esev commented Jan 21, 2025

Not at all right now. :(

@NireBryce
Copy link

INFO | 02:36:30 35671 [Router] Node database full with 80 nodes and 124884 bytes free. Erasing oldest entry
INFO | 02:36:30 35671 [Router] Adding node to database with 80 nodes and 124884 bytes free!

does it still corrupt if you start with an empty nodeDB?

@thebentern
Copy link
Contributor

thebentern commented Jan 21, 2025

@esev @todd-herbert I wonder if we should revisit the seek(0) and truncate approach and switch to more destructive delete, then open for write? We may have introduced a variable here which is muddying the water on the problem.

@esev
Copy link
Contributor

esev commented Jan 21, 2025

INFO | 02:36:30 35671 [Router] Node database full with 80 nodes and 124884 bytes free. Erasing oldest entry
INFO | 02:36:30 35671 [Router] Adding node to database with 80 nodes and 124884 bytes free!

does it still corrupt if you start with an empty nodeDB?

Unfortunately, yes. I had been testing the format logic earlier in the day. So it did start from 0 at some point.

@esev
Copy link
Contributor

esev commented Jan 21, 2025

@esev @todd-herbert I wonder if we should revisit the seek(0) and truncate approach and switch to more destructive delete, then open for write? We may have introduced a variable here which is muddying the water on the problem.

I'll create a build for testing just prior to the changes to SafeFile, and cherry-pick in our retry/format changes too.

One theory that I have is that maybe the flash writes are reporting success but a bit or two didn't actually get set correctly. IIUC the flash cache layer should still have a copy of the expected write. I'll add some logic to compare it to what was actually written.

LittleFS does this comparison too. But I believe LittleFS is only configured to look at the change of a single 128 byte block. Whereas the flash writes cover an entire 4096 byte flash page.

@caveman99 caveman99 changed the title [Bug]: Critical fault #12 (flash filesystem corruption) on Heltec T114 v2 [Bug]: Critical fault #12 flash filesystem corruption and format on nrf52 platform Jan 22, 2025
@caveman99 caveman99 marked this as a duplicate of #5909 Jan 22, 2025
@Mictronics
Copy link
Contributor

Mictronics commented Jan 22, 2025

I have seen a similar problem on a T-Echo. Seen several fails on one of my T-Echos cause obviously by NodeDB file corruption.

During fail the log after reset is like this:

INFO | ??:??:?? 0

//\ E S H T /\ S T / C

DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0  adafruit (directory)
DEBUG | ??:??:?? 0  bond_prph (directory)
DEBUG | ??:??:?? 0  bond_cntr (directory)
DEBUG | ??:??:?? 0  prefs (directory)
DEBUG | ??:??:?? 0  module.proto (113 Bytes)
DEBUG | ??:??:?? 0  channels.proto (102 Bytes)
DEBUG | ??:??:?? 0  config.proto (230 Bytes)
DEBUG | ??:??:?? 0  db.proto (8776 Bytes)
DEBUG | ??:??:?? 0  meshtastic.txt (342 Bytes)
DEBUG | ??:??:?? 0 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 0 Use analog input 4 for battery level
INFO  | ??:??:?? 0 Scan for i2c devices
DEBUG | ??:??:?? 0 Scan for I2C devices on port 1
INFO  | ??:??:?? 0 PCF8563 found at address 0x51
INFO  | ??:??:?? 0 BME280 found at address 0x77
INFO  | ??:??:?? 0 2 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:7,2.5.20.22544c18c
DEBUG | ??:??:?? 0 Reset reason: 0x0
DEBUG | ??:??:?? 0 Set random seed 3136426161
INFO  | ??:??:?? 0 Init NodeDB
INFO  | ??:??:?? 0 Load /prefs/db.proto
ERROR | ??:??:?? 0 LFS assert: head >= 2 && head <= lfs->cfg->block_count

I modified the code to trigger an assert intentionally in lfs_file_open. This results in the following log (two times device reset):

��@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0  prefs (directory)
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:??��@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0  prefs (directory)
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:??��@INFO  | ??:??:?? 0 

Again, on assert the device is frozen/locked .

Next I commented the FSCom.format()in the lfs_assert function, resulting in the following log:

//\ E S H T /\ S T / C

DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0  prefs (directory)
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:?? 0 Trying FSCom.format()
DEBUG | ??:??:?? 0    config.proto (272 Bytes)
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:?? 0 Trying FSCom.format()
DEBUG | ??:??:?? 0    channels.proto (104 Bytes)
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:?? 0 Trying FSCom.format()
DEBUG | ??:??:?? 0    db.proto (5938 Bytes)
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:?? 0 Trying FSCom.format()
DEBUG | ??:??:?? 0    module.proto (121 Bytes)
DEBUG | ??:??:?? 0  adafruit (directory)
DEBUG | ??:??:?? 0  bond_prph (directory)
DEBUG | ??:??:?? 0  bond_cntr (directory)
DEBUG | ??:??:?? 0 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 0 Use analog input 4 for battery level
INFO  | ??:??:?? 0 Scan for i2c devices
DEBUG | ??:??:?? 0 Scan for I2C devices on port 1
INFO  | ??:??:?? 0 PCF8563 found at address 0x51
INFO  | ??:??:?? 0 BME280 found at address 0x77
INFO  | ??:??:?? 0 2 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:7,2.5.20.22544c18c
DEBUG | ??:??:?? 0 Reset reason: 0x0
DEBUG | ??:??:?? 0 Set random seed 2181716872
INFO  | ??:??:?? 0 Init NodeDB
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:?? 0 Trying FSCom.format()
INFO  | ??:??:?? 0 Load /prefs/db.proto
INFO  | ??:??:?? 0 Loaded /prefs/db.proto successfully
INFO  | ??:??:?? 0 Loaded saved devicestate version 23, with nodecount: 71
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:?? 0 Trying FSCom.format()
INFO  | ??:??:?? 0 Load /prefs/config.proto
INFO  | ??:??:?? 0 Loaded /prefs/config.proto successfully
INFO  | ??:??:?? 0 Loaded saved config version 23
ERROR | ??:??:?? 0 LFS assert: false
ERROR | ??:??:?? 0 Trying FSCom.format()
INFO  | ??:??:?? 0 Load /prefs/module.proto
INFO  | ??:??:?? 0 Loaded /prefs/module.proto successfully
INFO  | ??:??:?? 0 Loaded saved moduleConfig version 23
ERROR | ??:??:?? 0 LFS assert: false

Now, while the assert is triggered, we get the Trying FSCom.format() log message and the code continues.

After setting up a JLink and debugging the code it seems for this exact problem the code is never going beyond _lockFS() in

bool Adafruit_LittleFS::format (void)
{
  _lockFS();

  int err = LFS_ERR_OK;
  bool attemptMount = _mounted;

_lockFS expands to xSemaphoreTake(_mutex, portMAX_DELAY); and time of portMAX_DELAY can be used to block indefinitely.

So I modfied the format function to use a finite delay: xSemaphoreTake(_mutex, portTICK_PERIOD_MS * 100);resulting in:


//\ E S H T /\ S T / C

DEBUG | ??:??:?? 4 Filesystem files:
DEBUG | ??:??:?? 4  adafruit (directory)
DEBUG | ??:??:?? 4  bond_prph (directory)
DEBUG | ??:??:?? 4  bond_cntr (directory)
DEBUG | ??:??:?? 4  prefs (directory)
DEBUG | ??:??:?? 4 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 4 Use analog input 4 for battery level
INFO  | ??:??:?? 4 Scan for i2c devices
DEBUG | ??:??:?? 4 Scan for I2C devices on port 1
INFO  | ??:??:?? 4 PCF8563 found at address 0x51
INFO  | ??:??:?? 4 BME280 found at address 0x77
INFO  | ??:??:?? 4 2 I2C devices found
DEBUG | ??:??:?? 4 acc_info = 0
INFO  | ??:??:?? 4 S:B:7,2.5.20.22544c18c
DEBUG | ??:??:?? 4 Reset reason: 0x0
DEBUG | ??:??:?? 4 Set random seed 138776071
INFO  | ??:??:?? 4 Init NodeDB
ERROR | ??:??:?? 4 Could not open / read /prefs/db.proto
WARN  | ??:??:?? 4 Devicestate 0 is old, discard
INFO  | ??:??:?? 4 Install default DeviceState
DEBUG | ??:??:?? 4 Initial packet id 440999420
DEBUG | ??:??:?? 4 Partially randomized packet id 1067092477
DEBUG | ??:??:?? 4 Use nodenum 0xe543e4cf 
ERROR | ??:??:?? 4 Could not open / read /prefs/config.proto
INFO  | ??:??:?? 4 Install default LocalConfig
INFO  | ??:??:?? 4 Set default channel and radio preferences!
INFO  | ??:??:?? 4 Wanted region 3, using EU_868
ERROR | ??:??:?? 4 Could not open / read /prefs/module.proto
INFO  | ??:??:?? 4 Install default ModuleConfig
ERROR | ??:??:?? 4 Could not open / read /prefs/channels.proto
INFO  | ??:??:?? 4 Install default ChannelFile
ERROR | ??:??:?? 4 Could not open / read /prefs/uiconfig.proto
DEBUG | ??:??:?? 4 cleanupMeshDB purged 0 entries
DEBUG | ??:??:?? 4 Use nodenum 0xe543e4cf 
INFO  | ??:??:?? 4 Adding node to database with 1 nodes and 164212 bytes free!
INFO  | ??:??:?? 4 Generate new PKI keys
DEBUG | ??:??:?? 4 Generate Curve25519 keypair
INFO  | ??:??:?? 4 Set default channel and radio preferences!
INFO  | ??:??:?? 4 Wanted region 3, using EU_868
DEBUG | ??:??:?? 4 Opening /prefs/config.proto, fullAtomic=1
ERROR | ??:??:?? 4 LFS assert: false
ERROR | ??:??:?? 4 Trying FSCom.format()
INFO  | ??:??:?? 10 Save /prefs/config.proto
ERROR | ??:??:?? 10 Error: can't encode protobuf io error
ERROR | ??:??:?? 10 Can't write prefs!
DEBUG | ??:??:?? 11 Opening /prefs/channels.proto, fullAtomic=1
ERROR | ??:??:?? 11 LFS assert: false
ERROR | ??:??:?? 11 Trying FSCom.format()
INFO  | ??:??:?? 14 Save /prefs/channels.proto
ERROR | ??:??:?? 14 Error: can't encode protobuf io error
ERROR | ??:??:?? 14 Can't write prefs!
DEBUG | ??:??:?? 14 Opening /prefs/db.proto, fullAtomic=0
ERROR | ??:??:?? 14 LFS assert: false
ERROR | ??:??:?? 14 Trying FSCom.format()
INFO  | ??:??:?? 14 Save /prefs/db.proto
ERROR | ??:??:?? 14 Error: can't encode protobuf io error
ERROR | ??:??:?? 14 Can't write prefs!
ERROR | ??:??:?? 14 Failed to save to disk, retrying
DEBUG | ??:??:?? 15 Opening /prefs/config.proto, fullAtomic=1
ERROR | ??:??:?? 15 LFS assert: false
ERROR | ??:??:?? 15 Trying FSCom.format()
INFO  | ??:??:?? 15 Save /prefs/config.proto
ERROR | ??:??:?? 15 Error: can't encode protobuf io error
ERROR | ??:??:?? 15 Can't write prefs!
DEBUG | ??:??:?? 15 Opening /prefs/channels.proto, fullAtomic=1
ERROR | ??:??:?? 15 LFS assert: false
ERROR | ??:??:?? 15 Trying FSCom.format()
INFO  | ??:??:?? 16 Save /prefs/channels.proto
ERROR | ??:??:?? 16 Error: can't encode protobuf io error
ERROR | ??:??:?? 16 Can't write prefs!
DEBUG | ??:??:?? 16 Opening /prefs/db.proto, fullAtomic=0
ERROR | ??:??:?? 16 LFS assert: false
ERROR | ??:??:?? 16 Trying FSCom.format()
INFO  | ??:??:?? 16 Save /prefs/db.proto
ERROR | ??:??:?? 16 Error: can't encode protobuf io error
ERROR | ??:??:?? 16 Can't write prefs!
ERROR | ??:??:?? 16 NOTE! Record critical error 13 at src/mesh/NodeDB.cpp:1242

So it seems my issue is with infinitely blocked _lockFS.

[Edit]
I think that makes sense somehow. When we assert out of another function with file system access then the mutex is already taken by a previous _lockFS().

Removing the intentional assert but keeping the xSemaphoreTake(_mutex, portTICK_PERIOD_MS * 100); results in FS format and default configuration:

//\ E S H T /\ S T / C

DEBUG | ??:??:?? 4 Filesystem files:
DEBUG | ??:??:?? 4 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 4 Use analog input 4 for battery level
INFO  | ??:??:?? 4 Scan for i2c devices
DEBUG | ??:??:?? 4 Scan for I2C devices on port 1
INFO  | ??:??:?? 4 PCF8563 found at address 0x51
INFO  | ??:??:?? 4 BME280 found at address 0x77
INFO  | ??:??:?? 4 2 I2C devices found
DEBUG | ??:??:?? 4 acc_info = 0
INFO  | ??:??:?? 4 S:B:7,2.5.20.22544c18c
DEBUG | ??:??:?? 4 Reset reason: 0x0
DEBUG | ??:??:?? 4 Set random seed 3211863722
INFO  | ??:??:?? 4 Init NodeDB
ERROR | ??:??:?? 4 Could not open / read /prefs/db.proto
WARN  | ??:??:?? 4 Devicestate 0 is old, discard
INFO  | ??:??:?? 4 Install default DeviceState
DEBUG | ??:??:?? 4 Initial packet id 1691170642
DEBUG | ??:??:?? 4 Partially randomized packet id 2298260307
DEBUG | ??:??:?? 4 Use nodenum 0xe543e4cf 
ERROR | ??:??:?? 4 Could not open / read /prefs/config.proto
INFO  | ??:??:?? 4 Install default LocalConfig
INFO  | ??:??:?? 4 Set default channel and radio preferences!
INFO  | ??:??:?? 4 Wanted region 3, using EU_868
ERROR | ??:??:?? 4 Could not open / read /prefs/module.proto
INFO  | ??:??:?? 4 Install default ModuleConfig
ERROR | ??:??:?? 4 Could not open / read /prefs/channels.proto
INFO  | ??:??:?? 4 Install default ChannelFile
ERROR | ??:??:?? 4 Could not open / read /prefs/uiconfig.proto
DEBUG | ??:??:?? 4 cleanupMeshDB purged 0 entries

@esev
Copy link
Contributor

esev commented Jan 22, 2025

Thanks @Mictronics for that very detailed analysis!

Would you be able to test with #5900?

@Mictronics
Copy link
Contributor

Mictronics commented Jan 22, 2025

Sure, let me pull this in and I will report back.

[Edit]
Initial boot then pressing a in the console. (on T-Echo)

��@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0  prefs (directory)
DEBUG | ??:??:?? 0    config.proto (203 Bytes)
DEBUG | ??:??:?? 0    channels.proto (133 Bytes)
DEBUG | ??:??:?? 0    db.proto (564 Bytes)
DEBUG | ??:??:?? 0  adafruit (directory)
DEBUG | ??:??:?? 0  bond_prph (directory)
DEBUG | ??:??:?? 0  bond_cntr (directory)
DEBUG | ??:??:?? 0 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 0 Use analog input 4 for battery level
INFO  | ??:??:?? 0 Scan for i2c devices
DEBUG | ??:??:?? 0 Scan for I2C devices on port 1
INFO  | ??:??:?? 0 PCF8563 found at address 0x51
INFO  | ??:??:?? 0 BME280 found at address 0x77
INFO  | ??:??:?? 0 2 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:7,2.5.20.b8592f886
DEBUG | ??:??:?? 0 Reset reason: 0x0
DEBUG | ??:??:?? 0 Set random seed 1402567225
INFO  | ??:??:?? 0 Init NodeDB
INFO  | ??:??:?? 0 Load /prefs/db.proto
INFO  | ??:??:?? 0 Loaded /prefs/db.proto successfully
INFO  | ??:??:?? 0 Loaded saved devicestate version 23, with nodecount: 5
INFO  | ??:??:?? 0 Load /prefs/config.proto
INFO  | ??:??:?? 0 Loaded /prefs/config.proto successfully
INFO  | ??:??:?? 0 Loaded saved config version 23
ERROR | ??:??:?? 0 Could not open / read /prefs/module.proto
INFO  | ??:??:?? 0 Install default ModuleConfig
INFO  | ??:??:?? 0 Load /prefs/channels.proto
INFO  | ??:??:?? 0 Loaded /prefs/channels.proto successfully
INFO  | ??:??:?? 0 Loaded saved channelFile version 23
DEBUG | ??:??:?? 0 cleanupMeshDB purged 3 entries
DEBUG | ??:??:?? 0 Use nodenum 0xe543e4cf 
:
:
:
DEBUG | 19:45:51 9 [RadioIf] Started Tx (id=0xb4bc655f fr=0xe7159523 to=0xffffffff, WantAck=0, HopLim=5 Ch=0xea encrypted len=39 rxSNR=6.25 rxRSSI=-46 hopStart=7 priority=64)
DEBUG | 19:45:51 9 [RadioIf] Packet TX: 526ms
DEBUG | 19:45:51 9 [RadioIf] 1 packets remain in the TX queue
DEBUG | 19:45:51 9 [RadioIf] Completed sending (id=0xb4bc655f fr=0xe7159523 to=0xffffffff, WantAck=0, HopLim=5 Ch=0xea encrypted len=39 rxSNR=6.25 rxRSSI=-46 hopStart=7 priority=64)
DEBUG | 19:45:51 9 [RadioIf] Started Tx (id=0xd903bc2e fr=0xe543e4cf to=0xe7159523, WantAck=0, HopLim=3 Ch=0xea encrypted len=102 hopStart=3 priority=10)
DEBUG | 19:45:51 9 [RadioIf] Packet TX: 993ms
DEBUG | 19:45:51 9 [RadioIf] 0 packets remain in the TX queue
DEBUG | 19:45:52 10 [RadioIf] Completed sending (id=0xd903bc2e fr=0xe543e4cf to=0xe7159523, WantAck=0, HopLim=3 Ch=0xea encrypted len=102 hopStart=3 priority=10)
ERROR | 19:45:53 11 [SerialConsole] LittleFS corruption detected: StreamAPI
INFO  | 19:45:53 11 [SerialConsole] Rebooting to format LittleFS
��@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 0 LittleFS format complete; restoring default settings
DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 0 Use analog input 4 for battery level
INFO  | ??:??:?? 0 Scan for i2c devices
DEBUG | ??:??:?? 0 Scan for I2C devices on port 1
INFO  | ??:??:?? 0 PCF8563 found at address 0x51
INFO  | ??:??:?? 0 BME280 found at address 0x77
INFO  | ??:??:?? 0 2 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:7,2.5.20.b8592f886
DEBUG | ??:??:?? 0 Reset reason: 0x0
DEBUG | ??:??:?? 1 Set random seed 1998879461
INFO  | ??:??:?? 1 Init NodeDB
ERROR | ??:??:?? 1 Could not open / read /prefs/db.proto
WARN  | ??:??:?? 1 Devicestate 0 is old, discard
INFO  | ??:??:?? 1 Install default DeviceState
DEBUG | ??:??:?? 1 Initial packet id 1956988456
DEBUG | ??:??:?? 1 Partially randomized packet id 1389333033
DEBUG | ??:??:?? 1 Use nodenum 0xe543e4cf 
ERROR | ??:??:?? 1 Could not open / read /prefs/config.proto
INFO  | ??:??:?? 1 Install default LocalConfig
INFO  | ??:??:?? 1 Set default channel and radio preferences!
INFO  | ??:??:?? 1 Wanted region 3, using EU_868
ERROR | ??:??:?? 1 Could not open / read /prefs/module.proto
INFO  | ??:??:?? 1 Install default ModuleConfig
ERROR | ??:??:?? 1 Could not open / read /prefs/channels.proto
INFO  | ??:??:?? 1 Install default ChannelFile
DEBUG | ??:??:?? 1 cleanupMeshDB purged 0 entries
DEBUG | ??:??:?? 1 Use nodenum 0xe543e4cf 
INFO  | ??:??:?? 1 Adding node to database with 1 nodes and 164104 bytes free!
INFO  | ??:??:?? 1 Generate new PKI keys
DEBUG | ??:??:?? 1 Generate Curve25519 keypair
INFO  | ??:??:?? 1 Set default channel and radio preferences!
INFO  | ??:??:?? 1 Wanted region 3, using EU_868
DEBUG | ??:??:?? 1 Opening /prefs/config.proto, fullAtomic=1
INFO  | ??:??:?? 1 Save /prefs/config.proto
DEBUG | ??:??:?? 1 Opening /prefs/channels.proto, fullAtomic=1
INFO  | ??:??:?? 1 Save /prefs/channels.proto
DEBUG | ??:??:?? 1 Opening /prefs/db.proto, fullAtomic=0
INFO  | ??:??:?? 2 Save /prefs/db.proto
DEBUG | ??:??:?? 2 Use GPIO42 for button
DEBUG | ??:??:?? 2 Use compiled/slipstreamed "CET-1CEST,M3.5.0/2,M10.5.0/3"
DEBUG | ??:??:?? 2 Set Timezone to "CET-1CEST,M3.5.0/2,M10.5.0/3"
DEBUG | ??:??:?? 2 Read RTC time from PCF8563 getDateTime as 2025-01-22 18:45:57 (1737571557)
INFO  | 19:45:57 2 GPS power state move from OFF to ACTIVE
DEBUG | 19:45:57 2 NeighborInfoModule is disabled
DEBUG | 19:45:57 2 SX126xInterface(cs=24, irq=20, rst=25, busy=17)
DEBUG | 19:45:57 2 SX126X_DIO3_TCXO_VOLTAGE defined, using DIO3 as TCXO reference voltage at 1.800000 V
INFO  | 19:45:57 2 Start meshradio init
INFO  | 19:45:57 2 Radio freq=869.525, config.lora.frequency_offset=0.000
INFO  | 19:45:57 2 Set radio: region=EU_868, name="Private", config=0, ch=0, power=27
INFO  | 19:45:57 2 myRegion->freqStart -> myRegion->freqEnd: 869.400024 -> 869.650024 (0.250000 MHz)
INFO  | 19:45:57 2 numChannels: 1 x 250.000kHz
INFO  | 19:45:57 2 channel_num: 1
INFO  | 19:45:57 2 frequency: 869.525024
INFO  | 19:45:57 2 Slot time: 77 msec
INFO  | 19:45:57 2 Set radio: final power level=22
INFO  | 19:45:57 2 SX126x init result 0
INFO  | 19:45:57 2 Frequency set to 869.525024
INFO  | 19:45:57 2 Bandwidth set to 250.000000
INFO  | 19:45:57 2 Power output set to 22
DEBUG | 19:45:57 2 Current limit set to 140.000000
DEBUG | 19:45:57 2 Current limit set result 0
DEBUG | 19:45:57 2 Set DIO2 as RF switch, result: 0
DEBUG | 19:45:57 2 SX126X_RXEN not defined, defaulting to RADIOLIB_NC
DEBUG | 19:45:57 2 SX126X_TXEN not defined, defaulting to RADIOLIB_NC
DEBUG | 19:45:57 2 Use MCU pin -1 as RXEN and pin -1 as TXEN to control RF switching
INFO  | 19:45:57 2 Set RX gain to boosted mode; result: 0
INFO  | 19:45:57 2 SX1262 init success
DEBUG | 19:45:57 2 LoRA bitrate = 118.394310 bytes / sec
INFO  | 19:45:57 2 PowerFSM init, USB power=1
DEBUG | 19:45:57 2 State: BOOT
ERROR | 19:45:57 2 NOTE! Record critical error 13 at src/platform/nrf52/main-nrf52.cpp:146
DEBUG | 19:45:57 2 [Power] Battery: usbPower=1, isCharging=1, batMv=4763, batPct=100
DEBUG | 19:45:57 2 [GPS] Probe for GPS at 9600

@Mictronics
Copy link
Contributor

And another real test of the new fix: My mobile T-Echo failed again. :-( Again during a time when a lot of new nodes were coming in.

Device frozen in lockFS(), just after reset:


//\ E S H T /\ S T / C

DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0  prefs (directory)
DEBUG | ??:??:?? 0    config.proto (270 Bytes)
DEBUG | ??:??:?? 0    channels.proto (277 Bytes)
DEBUG | ??:??:?? 0    db.proto (9504 Bytes)
DEBUG | ??:??:?? 0    module.proto (120 Bytes)
DEBUG | ??:??:?? 0  adafruit (directory)
DEBUG | ??:??:?? 0  bond_prph (directory)
DEBUG | ??:??:?? 0  bond_cntr (directory)
DEBUG | ??:??:?? 0 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 0 Use analog input 4 for battery level
INFO  | ??:??:?? 0 Scan for i2c devices
DEBUG | ??:??:?? 0 Scan for I2C devices on port 1
INFO  | ??:??:?? 0 PCF8563 found at address 0x51
INFO  | ??:??:?? 0 BME280 found at address 0x77
INFO  | ??:??:?? 0 2 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:7,2.5.20.f21cae3cb
DEBUG | ??:??:?? 0 Reset reason: 0x0
DEBUG | ??:??:?? 0 Set random seed 3700966679
INFO  | ??:??:?? 0 Init NodeDB
INFO  | ??:??:?? 0 Load /prefs/db.proto
ERROR | ??:??:?? 0 LFS assert: head >= 2 && head <= lfs->cfg->block_count

Installed right away the build with latest HEAD up to 8e8b22e.

��@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0  prefs (directory)
DEBUG | ??:??:?? 0    config.proto (270 Bytes)
DEBUG | ??:??:?? 0    channels.proto (277 Bytes)
DEBUG | ??:??:?? 0    db.proto (9504 Bytes)
DEBUG | ??:??:?? 0    module.proto (120 Bytes)
DEBUG | ??:??:?? 0  adafruit (directory)
DEBUG | ??:??:?? 0  bond_prph (directory)
DEBUG | ??:??:?? 0  bond_cntr (directory)
DEBUG | ??:??:?? 0 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 0 Use analog input 4 for battery level
INFO  | ??:??:?? 0 Scan for i2c devices
DEBUG | ??:??:?? 0 Scan for I2C devices on port 1
INFO  | ??:??:?? 0 PCF8563 found at address 0x51
INFO  | ??:??:?? 0 BME280 found at address 0x77
INFO  | ??:??:?? 0 2 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:7,2.5.20.b8592f886
DEBUG | ??:??:?? 0 Reset reason: 0x0
DEBUG | ??:??:?? 0 Set random seed 2077297047
INFO  | ??:??:?? 0 Init NodeDB
INFO  | ??:??:?? 0 Load /prefs/db.proto
ERROR | ??:??:?? 0 LittleFS corruption detected: head >= 2 && head <= lfs->cfg->block_count
INFO  | ??:??:?? 0 Rebooting to format LittleFS
��@INFO  | ??:??:?? 0 

//\ E S H T /\ S T / C

INFO  | ??:??:?? 0 LittleFS format complete; restoring default settings
DEBUG | ??:??:?? 0 Filesystem files:
DEBUG | ??:??:?? 0 Power::lipoInit lipo sensor is not ready yet
DEBUG | ??:??:?? 0 Use analog input 4 for battery level
INFO  | ??:??:?? 0 Scan for i2c devices
DEBUG | ??:??:?? 0 Scan for I2C devices on port 1
INFO  | ??:??:?? 0 PCF8563 found at address 0x51
INFO  | ??:??:?? 0 BME280 found at address 0x77
INFO  | ??:??:?? 0 2 I2C devices found
DEBUG | ??:??:?? 0 acc_info = 0
INFO  | ??:??:?? 0 S:B:7,2.5.20.b8592f886
DEBUG | ??:??:?? 0 Reset reason: 0x0
DEBUG | ??:??:?? 1 Set random seed 3583629176
INFO  | ??:??:?? 1 Init NodeDB
ERROR | ??:??:?? 1 Could not open / read /prefs/db.proto
WARN  | ??:??:?? 1 Devicestate 0 is old, discard
INFO  | ??:??:?? 1 Install default DeviceState
DEBUG | ??:??:?? 1 Initial packet id 1514645686
DEBUG | ??:??:?? 1 Partially randomized packet id 3290954935
DEBUG | ??:??:?? 1 Use nodenum 0xebe30cb7 
ERROR | ??:??:?? 1 Could not open / read /prefs/config.proto
INFO  | ??:??:?? 1 Install default LocalConfig
INFO  | ??:??:?? 1 Set default channel and radio preferences!
INFO  | ??:??:?? 1 Wanted region 3, using EU_868
ERROR | ??:??:?? 1 Could not open / read /prefs/module.proto
INFO  | ??:??:?? 1 Install default ModuleConfig
ERROR | ??:??:?? 1 Could not open / read /prefs/channels.proto
INFO  | ??:??:?? 1 Install default ChannelFile
DEBUG | ??:??:?? 1 cleanupMeshDB purged 0 entries
DEBUG | ??:??:?? 1 Use nodenum 0xebe30cb7 
INFO  | ??:??:?? 1 Adding node to database with 1 nodes and 164104 bytes free!
INFO  | ??:??:?? 1 Generate new PKI keys
DEBUG | ??:??:?? 1 Generate Curve25519 keypair
INFO  | ??:??:?? 1 Set default channel and radio preferences!
INFO  | ??:??:?? 1 Wanted region 3, using EU_868
DEBUG | ??:??:?? 1 Opening /prefs/config.proto, fullAtomic=1
INFO  | ??:??:?? 1 Save /prefs/config.proto
DEBUG | ??:??:?? 1 Opening /prefs/channels.proto, fullAtomic=1
INFO  | ??:??:?? 1 Save /prefs/channels.proto
DEBUG | ??:??:?? 1 Opening /prefs/db.proto, fullAtomic=0
INFO  | ??:??:?? 2 Save /prefs/db.proto
DEBUG | ??:??:?? 2 Use GPIO42 for button
DEBUG | ??:??:?? 2 Use compiled/slipstreamed "CET-1CEST,M3.5.0/2,M10.5.0/3"
DEBUG | ??:??:?? 2 Set Timezone to "CET-1CEST,M3.5.0/2,M10.5.0/3"
DEBUG | ??:??:?? 2 Read RTC time from PCF8563 getDateTime as 2025-01-22 21:12:41 (1737580361)
INFO  | 22:12:41 2 GPS power state move from OFF to ACTIVE
DEBUG | 22:12:41 2 NeighborInfoModule is disabled
DEBUG | 22:12:41 2 SX126xInterface(cs=24, irq=20, rst=25, busy=17)
DEBUG | 22:12:41 2 SX126X_DIO3_TCXO_VOLTAGE defined, using DIO3 as TCXO reference voltage at 1.800000 V
INFO  | 22:12:41 2 Start meshradio init
INFO  | 22:12:41 2 Radio freq=869.525, config.lora.frequency_offset=0.000
INFO  | 22:12:41 2 Set radio: region=EU_868, name="Private", config=0, ch=0, power=27
INFO  | 22:12:41 2 myRegion->freqStart -> myRegion->freqEnd: 869.400024 -> 869.650024 (0.250000 MHz)
INFO  | 22:12:41 2 numChannels: 1 x 250.000kHz
INFO  | 22:12:41 2 channel_num: 1
INFO  | 22:12:41 2 frequency: 869.525024
INFO  | 22:12:41 2 Slot time: 77 msec
INFO  | 22:12:41 2 Set radio: final power level=22
INFO  | 22:12:41 2 SX126x init result 0
INFO  | 22:12:41 2 Frequency set to 869.525024
INFO  | 22:12:41 2 Bandwidth set to 250.000000
INFO  | 22:12:41 2 Power output set to 22
DEBUG | 22:12:41 2 Current limit set to 140.000000
DEBUG | 22:12:41 2 Current limit set result 0
DEBUG | 22:12:41 2 Set DIO2 as RF switch, result: 0
DEBUG | 22:12:41 2 SX126X_RXEN not defined, defaulting to RADIOLIB_NC
DEBUG | 22:12:41 2 SX126X_TXEN not defined, defaulting to RADIOLIB_NC
DEBUG | 22:12:41 2 Use MCU pin -1 as RXEN and pin -1 as TXEN to control RF switching
INFO  | 22:12:41 2 Set RX gain to boosted mode; result: 0
INFO  | 22:12:41 2 SX1262 init success
DEBUG | 22:12:41 2 LoRA bitrate = 118.394310 bytes / sec
INFO  | 22:12:41 2 PowerFSM init, USB power=1
DEBUG | 22:12:41 2 State: BOOT
ERROR | 22:12:41 2 NOTE! Record critical error 13 at src/platform/nrf52/main-nrf52.cpp:146
DEBUG | 22:12:41 2 [Power] Battery: usbPower=1, isCharging=1, batMv=4551, batPct=100
DEBUG | 22:12:41 2 [GPS] Probe for GPS at 9600

So while the actual fix is obviously solving the lock on boot issue the root cause of the NodeDB corruption issue still seem to exist.

@esev
Copy link
Contributor

esev commented Jan 22, 2025

That's a good test. Thanks for that!

meshtastic/Adafruit_nRF52_Arduino#1 goes a long way toward addressing the corruption issue.

@Mictronics
Copy link
Contributor

Oh wow, caused be BT? I guess I can setup a long term log an play around here. I have two T-Echo on the bench I use for handheld purpose.
And yeah, the affected T-Echo is almost all time connected via BT to the phone when at home.

@esev
Copy link
Contributor

esev commented Jan 22, 2025

Oh wow, caused be BT?

Seems to be. It's a really odd/unexpected interaction between the two.

I guess I can setup a long term log an play around here.

Thank you so much. This would be really helpful.

I've had mine crash a couple times still - but I can't rule out my own testing interfering with something. So far no issues in the last ~20 hours on my latest run.

I also added a memcmp in meshtastic/Adafruit_nRF52_Arduino#1 in the loops where the retries are handled. If the memory doesn't read back correctly it also retries. I'm not 100% sure that is needed though - as I haven't hit any corruption since running a build with that extra logic in it.

@Mictronics
Copy link
Contributor

But there is a nuance here. I had four different T-Echo failing this week. Out of this four, two were setup in the wild, a third one in the backyard. These three, I am 100% sure, had and still have BT disabled. Only the one handheld, that failed again today, has BT enabled.
So BT can't be the only cause for the flash corruption issue. I experience this issue only with a lot of node cycling through DB.

@esev
Copy link
Contributor

esev commented Jan 22, 2025

This is a good data point. What firmware version were you running on them all?

@esev
Copy link
Contributor

esev commented Jan 23, 2025

I just got another crash as well. And I hadn't been home, nor connected to the node, for 8+ hours when it crashed. None of the memcmp verification checks I added were ever triggered either.

This still wasn't pre-SafeFile changes though. That's my next test and will start running that code tonight.

@Mictronics
Copy link
Contributor

This is a good data point. What firmware version were you running on them all?

Out in the wild something like 2.5.15 not sure. In the backyard 2.5.18 and the handheld one always the latest so 2.5.19 to 2.5.20. All custom build.

@esev
Copy link
Contributor

esev commented Jan 23, 2025

Here are the logs covering my previous 3 crashes. [Edit: These logs were with my build including 85de193 & 1c0f43c] I'm now running a build without 85de193 & 1c0f43c to see if it avoids the issue.

DEBUG | 02:36:30 35671 [Router] Opening /prefs/db.proto, fullAtomic=0
INFO  | 02:36:30 35671 [Router] Save /prefs/db.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:19: Assert lfs_cache_read 68842597
ERROR | 02:36:31 35672 [Router] LittleFS corruption detected: block < lfs->cfg->block_count
DEBUG | 16:06:30 47125 [Router] Opening /prefs/db.proto, fullAtomic=0
INFO  | 16:06:30 47125 [Router] Save /prefs/db.proto
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
[SOC   ] NRF_EVT_FLASH_OPERATION_SUCCESS#
lfs debug:19: Assert lfs_cache_read 2411999021
ERROR | 16:06:30 47126 [Router] LittleFS corruption detected: block < lfs->cfg->block_count
DEBUG | 23:29:22 81090 [Router] Opening /prefs/db.proto, fullAtomic=0
INFO  | 23:29:22 81090 [Router] Save /prefs/db.proto
lfs debug:19: Assert lfs_cache_read 1212109790
ERROR | 23:29:22 81090 [Router] LittleFS corruption detected: block < lfs->cfg->block_count

@Mictronics
Copy link
Contributor

Mictronics commented Jan 23, 2025

My log is running. Let's see what we get. But my guess is a similar or equal error message.

@esev
Copy link
Contributor

esev commented Jan 23, 2025

@esev @todd-herbert I wonder if we should revisit the seek(0) and truncate approach and switch to more destructive delete, then open for write? We may have introduced a variable here which is muddying the water on the problem.

This is worth investigating: littlefs-project/littlefs#800 / littlefs-project/littlefs#268

@todd-herbert
Copy link
Contributor

todd-herbert commented Jan 23, 2025

Here are the logs covering my previous 3 crashes. I'm now running a build without 85de193 & 1c0f43c

Just to clarify, are those logs from a build with or without 85de193 & 1c0f43c?

@esev
Copy link
Contributor

esev commented Jan 23, 2025

Those logs were all with 85de193 & 1c0f43c.

@todd-herbert
Copy link
Contributor

Those logs were all with 85de193 & 1c0f43c.

Ah right! Fingers crossed that reverting them does help. I wouldn't be surprised, and at least it would point at a reasonable workaround for the time being.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests