Skip to content

fix(mem): align TLSF to 8 bytes on ARMv7-M/E-M/v8-M to avoid LDRD fault#10015

Draft
KentLee86 wants to merge 1 commit intolvgl:masterfrom
KentLee86:fix/tlsf-align-armv7m-ldrd
Draft

fix(mem): align TLSF to 8 bytes on ARMv7-M/E-M/v8-M to avoid LDRD fault#10015
KentLee86 wants to merge 1 commit intolvgl:masterfrom
KentLee86:fix/tlsf-align-armv7m-ldrd

Conversation

@KentLee86
Copy link
Copy Markdown

Fixes #4747 (unaligned memory access, previously closed as not planned — see below).

Summary

On 32-bit ARM Cortex-M3 / M4 / M7 / M33 (ARMv7-M, ARMv7E-M, ARMv8-M) the LDRD / STRD instructions require strict 8-byte alignment and fault with a UsageFault (UFSR.UNALIGNED) regardless of CCR.UNALIGN_TRP. GCC emits these for 64-bit struct accesses assuming malloc returns pointers aligned to alignof(max_align_t) — which is 8 on these targets because double / long long are 8-byte aligned.

The built-in TLSF allocator currently uses ALIGN_SIZE_LOG2 = 2 (4 bytes) on all 32-bit builds. In internal SRAM allocations usually happen to land on 8-byte boundaries and the issue stays hidden, but on external RAM pools (e.g. STM32H7 + FMC SDRAM configured via LV_MEM_ADR) the mis-aligned addresses are hit almost immediately and the CPU hard-faults inside lv_tlsf_malloclv_mem_core_builtin.c:147 during the first non-trivial allocation from the draw path.

Change

Detect LDRD-capable M-profile cores via __ARM_ARCH >= 7 (excluding ARMv6-M, which has no LDRD) and promote ALIGN_SIZE_LOG2 to 3 (8 bytes) there. TLSF_64BIT still takes precedence. No behavior change on AVR, RV32, ARMv6-M (Cortex-M0/M0+) or other 32-bit targets.

#if defined (TLSF_64BIT)
    ALIGN_SIZE_LOG2 = 3;
#elif (defined(__ARM_ARCH) && (__ARM_ARCH >= 7) && !defined(__ARM_ARCH_6M__))
    /* ARMv7-M / ARMv7E-M / ARMv8-M need 8-byte alignment for LDRD/STRD. */
    ALIGN_SIZE_LOG2 = 3;
#else
    ALIGN_SIZE_LOG2 = 2;
#endif

How it was diagnosed

Reproducer: [env:lvgl_test] on a STM32H743II board with 32 MB FMC SDRAM, LV_MEM_ADR = 0xC0100000 (4 MB pool in SDRAM free region after the 800×480 RGB565 LTDC framebuffer). lv_init() succeeds but lv_timer_handler() hard-faults on the first lv_draw_rect.

GDB attach after fault (J-Link):

CFSR  = 0x01000000
  UFSR = 0x0100   → bit 8 = UNALIGNED
HFSR  = 0x40000000 (FORCED)

Backtrace:
  WWDG_IRQHandler (Default_Handler Infinite_Loop)
  <signal handler called>
  lv_draw_rect             .../lv_draw_rect.c:256
  lv_obj_draw              .../lv_obj.c:733
  ...
  lv_malloc_core           .../lv_mem_core_builtin.c:147
  lv_tlsf_malloc (tlsf=0xC0100000)  .../lv_tlsf.c:1102

Ruled out by direct tests before landing on TLSF:

  • SDRAM / FMC / MPU: mixed-struct probe {u32, u16, u8, void*, u64} at 0xC0100000 reads/writes fine. A 25 MB pattern R/W stress alongside LVGL runs 401 iterations with 0 errors.
  • CCR.UNALIGN_TRP: already 0, clearing it explicitly has no effect (LDRD strict alignment is ISA-level, independent of that bit).
  • MPU: SDRAM region set to Normal / non-cacheable / bufferable — no change.
  • LVGL version: reproduced on both 9.2.2 and 9.3.0.

Workaround while the fix is not merged: build LVGL with -mno-unaligned-access so GCC stops emitting LDRD/STRD. This also fixes the symptom but doesn't address the root cause (TLSF's 4-byte alignment is below alignof(max_align_t) on these targets).

Verification

Same board, same env, only the diff in this PR applied (no -mno-unaligned-access):

  • lv_init() ✓, lv_timer_handler()
  • 8-row dashboard renders continuously, loop iter count grows, no fault
  • 4 MB SDRAM heap accepted, internal AXI SRAM usage drops from 90 % to 15 %

Notes

  • Code style: single #elif block matching the existing #if defined (TLSF_64BIT) style, no reformatting.
  • No new options in lv_conf_template.hlv_conf_internal_gen.py / Kconfig not affected.
  • Doc update not needed — allocator ABI is unchanged.
  • Tests: none added; the existing TLSF tests still pass on x86_64 (via TLSF_64BIT), and adding an ARMv7-M HW test would require CI changes. Happy to add one if that's the project's preference.

Related reports

  • unaligned memory access #4747unaligned memory access, same symptom, closed as not planned
  • LVGL forum "Getting hardfault when using Ext. RAM for LV_MEM_ADR" (12766)
  • STMicroelectronics community "stm32h7b0 has UNALIGNED hardfault problem when using LVGL 9.1" (695235)

Marked as Draft for maintainer feedback on the detection macro and whether an additional conditional (e.g. also __ARM_FEATURE_LDRD) is preferred.

On 32-bit ARM Cortex-M3 / M4 / M7 / M33 (ARMv7-M, ARMv7E-M, ARMv8-M) the
LDRD / STRD instructions require strict 8-byte alignment and fault with a
UsageFault (UFSR.UNALIGNED) regardless of CCR.UNALIGN_TRP. GCC emits these
for 64-bit struct accesses assuming malloc returns pointers aligned to
alignof(max_align_t), which is 8 on these targets (double / long long).

The TLSF built-in allocator currently uses 4-byte alignment for 32-bit
builds, so block addresses can end up only 4-byte aligned. In internal
SRAM this usually slips by because allocations tend to land on 8-byte
boundaries anyway, but on external RAM pools (e.g. STM32H7 + FMC SDRAM
configured via LV_MEM_ADR) the mis-aligned addresses are hit quickly and
the CPU hard-faults inside lv_tlsf_malloc / lv_draw_rect on the first
non-trivial allocation.

Detect LDRD-capable ARM M-profile cores via __ARM_ARCH >= 7 (excluding
ARMv6-M, which has no LDRD) and promote ALIGN_SIZE_LOG2 to 3 (8 bytes)
there. No behavior change on AVR / RV32 / ARMv6-M / other 32-bit targets,
and TLSF_64BIT still takes precedence.

Related: lvgl#4747 (unaligned hardfault, closed as not planned), LVGL forum
"Getting hardfault when using Ext. RAM for LV_MEM_ADR" and multiple
STM32H7 + LVGL reports in the ST community.

Verified on STM32H743II + 32 MB SDRAM: LV_MEM_ADR in SDRAM (4 MB) now
runs a 8-row 800x480 dashboard cleanly; previously HardFault on the
first lv_draw_rect.

Signed-off-by: wslee <dldntjr407@gmail.com>
@KentLee86
Copy link
Copy Markdown
Author

On second thought, here's how a maintainer can verify the root cause without any external hardware or reproducer project — just by inspecting an existing ARMv7-M LVGL build artifact.

Verify GCC does emit LDRD on malloc-returned pointers

Take any existing LVGL build for Cortex-M3 / M4 / M7 (for example the STM32H7 demo in lv_port_riverdi_101-stm32h7 compiled with the default toolchain):

arm-none-eabi-objdump -d firmware.elf \
  | awk '/^[0-9a-f]+ <lv_draw_rect>/,/^$/' \
  | grep -E 'ldrd|strd' | head

You will see LDRD/STRD instructions inside lv_draw_rect (and many other LVGL functions) whose base register holds a pointer returned by lv_malloc. The compiler emits them because it assumes malloc honors alignof(max_align_t) == 8 on these targets.

Why TLSF's current 4-byte alignment is UB on these targets

Per ARMv7-M Architecture Reference Manual (§A3.2.1), LDRD / STRD with a base other than SP raise UsageFault on any address that is not 8-byte aligned, regardless of SCB->CCR.UNALIGN_TRP. alignof(max_align_t) on ARMv7-M GCC is 8 (driven by double / long long), so any malloc must return at least 8-byte alignment to satisfy C11 §7.22.3:

The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object with a fundamental alignment requirement…

TLSF with ALIGN_SIZE_LOG2 = 2 violates that on any target where alignof(max_align_t) > 4. The violation only surfaces when actual allocations land on 4-byte-but-not-8-byte boundaries, which is rare in internal SRAM (alloc order happens to be favorable) but common when the pool is placed in external RAM via LV_MEM_ADR.

So the fix in this PR isn't "add a workaround for STM32H7" — it's restoring the alignment contract that the C standard already requires of any general-purpose allocator on these targets. The same latent UB exists on M3 / M4 / M33 but just happens to rarely hit the wrong byte boundary.

Net effect of the change

  • Wastes at most 4 bytes per allocation on 32-bit ARMv7-M/E-M/v8-M targets (worst case).
  • No change to TLSF_64BIT builds, AVR, RV32, ARMv6-M (Cortex-M0/M0+).
  • Closes a class of "sporadic hardfault deep in LVGL when moving heap to external RAM" reports without requiring users to add -mno-unaligned-access or hand-pick an aligned address.

@github-actions
Copy link
Copy Markdown
Contributor

Hi 👋, thank you for your PR!

We've run benchmarks in an emulated environment. Here are the results:

ARM Emulated 32b - lv_conf_perf32b

Scene Name Avg CPU (%) Avg FPS Avg Time (ms) Render Time (ms) Flush Time (ms)
All scenes avg. 28 37 7 7 0
Detailed Results Per Scene
Scene Name Avg CPU (%) Avg FPS Avg Time (ms) Render Time (ms) Flush Time (ms)
Empty screen 11 33 0 0 0
Moving wallpaper 2 33 1 1 0
Single rectangle 0 50 0 0 0
Multiple rectangles 0 33 (-1) 0 0 0
Multiple RGB images 0 39 0 0 0
Multiple ARGB images 10 (-6) 41 (+3) 2 (-2) 2 (-2) 0
Rotated ARGB images 57 (-2) 44 15 15 0
Multiple labels 4 (+1) 35 (+2) 0 0 0
Screen sized text 83 (+2) 45 17 17 0
Multiple arcs 39 33 7 7 0
Containers 4 (+1) 37 (-1) 0 0 0
Containers with overlay 89 (-1) 21 44 44 0
Containers with opa 14 37 1 1 0
Containers with opa_layer 19 (+1) 34 5 5 0
Containers with scrolling 45 (+1) 45 12 12 0
Widgets demo 72 (+1) 39 (-1) 16 (-1) 16 (-1) 0
All scenes avg. 28 37 7 7 0

ARM Emulated 64b - lv_conf_perf64b

Scene Name Avg CPU (%) Avg FPS Avg Time (ms) Render Time (ms) Flush Time (ms)
All scenes avg. 25 37 6 6 0
Detailed Results Per Scene
Scene Name Avg CPU (%) Avg FPS Avg Time (ms) Render Time (ms) Flush Time (ms)
Empty screen 11 33 0 0 0
Moving wallpaper 1 33 0 0 0
Single rectangle 0 50 0 0 0
Multiple rectangles 0 35 0 0 0
Multiple RGB images 0 39 0 0 0
Multiple ARGB images 11 42 0 0 0
Rotated ARGB images 29 33 9 9 0
Multiple labels 2 35 0 0 0
Screen sized text 85 46 18 18 0
Multiple arcs 33 33 6 6 0
Containers 4 37 (-1) 0 0 0
Containers with overlay 98 (+1) 22 42 (+1) 42 (+1) 0
Containers with opa 15 38 0 0 0
Containers with opa_layer 8 (+1) 36 2 (+1) 2 (+1) 0
Containers with scrolling 49 (+1) 49 12 12 0
Widgets demo 67 40 15 15 0
All scenes avg. 25 37 6 6 0

Disclaimer: These benchmarks were run in an emulated environment using QEMU with instruction counting mode.
The timing values represent relative performance metrics within this specific virtualized setup and should
not be interpreted as absolute real-world performance measurements. Values are deterministic and useful for
comparing different LVGL features and configurations, but may not correlate directly with performance on
physical hardware. The measurements are intended for comparative analysis only.


🤖 This comment was automatically generated by a bot.

Copy link
Copy Markdown
Member

@kisvegabor kisvegabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can explain a lot of weird issues. Thank you for investigating it.

The changes are ok from my side. Let's wait for @AndreCostaaa's opinion too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unaligned memory access

3 participants