Skip to content

Conversation

@lexfrei
Copy link

@lexfrei lexfrei commented Dec 24, 2025

Summary

  • Call set_link_mode(rx_on_when_idle=true, device_type=false, network_data=false) after Thread stack initialization
  • Applied to both ThreadDriverTaskImpl and ThreadCoexDriverTaskImpl

Motivation

MTD (Minimal Thread Device) devices have rx_on_when_idle=false by default, causing them to miss incoming UDP packets when idle. This breaks Matter-over-Thread because:

  1. Device completes BLE commissioning (PASE handshake)
  2. Device joins Thread network and registers SRP services
  3. Matter controller tries to establish CASE session via Thread UDP
  4. Device misses CASE request because radio is off → timeout

Setting rx_on_when_idle=true keeps the radio receiver active, allowing the device to respond to CASE requests.

Dependencies

Requires set_link_mode() method in openthread crate: esp-rs/openthread#50

Test plan

  • Verified with ESP32-C6 smart-garland project
  • BLE commissioning completes successfully
  • Thread network join works
  • Device receives CASE session requests after this fix

Call set_link_mode(rx_on_when_idle=true) after Thread stack initialization
to ensure MTD devices can receive unsolicited UDP messages.

This is required for Matter-over-Thread devices to successfully complete
CASE session establishment after BLE commissioning, as the Matter controller
sends CASE requests via Thread UDP which would otherwise be missed by
sleeping MTD devices.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
@lexfrei lexfrei marked this pull request as ready for review December 24, 2025 21:10
Point to lexfrei/openthread#feat/add-set-link-mode which adds
the set_link_mode() method required for rx_on_when_idle support.

Upstream PR: esp-rs/openthread#50

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
lexfrei added a commit to lexfrei/smart-garland that referenced this pull request Dec 24, 2025
Replace local vendor paths with GitHub fork branches:
- rs-matter-embassy: lexfrei/rs-matter-embassy#feat/enable-rx-on-when-idle
- openthread: lexfrei/openthread#feat/add-set-link-mode

PRs pending upstream:
- esp-rs/openthread#50
- sysgrok/rs-matter-embassy#30

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
@ivmarkov
Copy link
Collaborator

Looks reasonable!

Still, if you could wait a few days for my return (afk right now - should be back Dec 31 / Jan 01).

I wonder how that would work for the nrf driver, where rx_when_idle seems not to be supported?

@lexfrei
Copy link
Author

lexfrei commented Dec 25, 2025

Regarding the nrf driver question: this is my first experience with both ESP and Rust, so my understanding may be incomplete. Here's what I found:

set_link_mode operates at the OpenThread API level (MLE layer), not at the PHY/radio driver level. If the nrf configuration doesn't support rx_on_when_idle, otThreadSetLinkMode should return an error (likely OT_ERROR_INVALID_ARGS), and the current code will propagate it via map_err(to_matter_err)?.

Possible approaches:

  1. Keep as-is — fail if rx_on_when_idle is truly required for Matter CASE sessions
  2. Graceful fallback — log a warning but continue:
    if let Err(e) = ot.set_link_mode(true, false, false) {
        warn!("Failed to set rx_on_when_idle: {:?}", e);
    }

Also, it probably makes sense to get esp-rs/openthread#50 merged first, so this PR doesn't depend on my fork.

Happy to adjust based on your guidance when you're back.

During Matter commissioning, the SRP removal loop could block
indefinitely waiting for records to be removed from the SRP server.
This consumed the Fail-Safe timer (typically 120-180 seconds), causing
CommissioningComplete to fail with ConstraintError.

Add a 10-second timeout to the removal loop. If records aren't removed
within this time, proceed anyway with a warning. This prevents the
Fail-Safe timer from expiring during the mDNS registration phase.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
@lexfrei lexfrei marked this pull request as draft December 25, 2025 21:09
lexfrei and others added 6 commits December 26, 2025 03:20
The previous fix with timeout made things worse - after timeout the SRP
client was still in "removing" state and rejected new configuration
with OtError(13) INVALID_STATE.

This change:
- Skip SRP removal entirely if already empty (fresh commissioning)
- Use immediate removal (true) instead of graceful (false)
- Remove the blocking wait loop entirely

Immediate removal doesn't wait for SRP server acknowledgment, which
avoids consuming the Matter Fail-Safe timer during commissioning.
Stale records on the server will be cleared on TTL expiry.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
The set_link_mode call was happening before the device attached to the
Thread network, so the setting was being ignored/reset during attach.

Move the set_link_mode call to OtNetCtl::connect() right after
role.is_connected() becomes true. This ensures rx_on_when_idle is
properly enabled after the device has joined the network.

Without this fix, the device acts as a Sleepy End Device and cannot
receive unsolicited messages like CASE session establishment from
the commissioner, causing Apple Home commissioning to fail.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
Add periodic SRP status logging to help debug commissioning issues.
Logs SRP running state and server address every 5 seconds.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
Log number of registered vs total services to debug
commissioning issues.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
Reduce interval from 5s to 1s for better visibility before Thread detach.
Add tick counter and detailed state counts.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
The immediate removal was being called on every loop iteration,
which removed services right after adding them. This caused services
to stay stuck in 'Adding' state forever.

Move the cleanup to run BEFORE the loop starts, so it only cleans up
stale records from previous device runs.

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Aleksei Sviridkin <[email protected]>
@ivmarkov
Copy link
Collaborator

ivmarkov commented Jan 6, 2026

@lexfrei This also fails with fmt issues.

select(self.0.wait_changed(), Timer::after(Duration::from_secs(1))).await;
}

// Enable rx_on_when_idle AFTER Thread attach so device can receive
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Must be called after the device has joined the network, not before".

Why is that? Do we know the root cause reason?

ot.set_link_mode(true, false, false)
.map_err(to_matter_err)?;

// SRP diagnostic task - logs status every 1 second for debugging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this diagnostics doesn't hurt, but if you could move to a separate async function - log_srp_state or such.


/// Run the `OtMdns` instance by listening to the mDNS services and registering them with the SRP server
pub async fn run(&self, matter: &Matter<'_>) -> Result<(), OtError> {
// On first iteration only: clean up any stale SRP records from previous runs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concerns with this.
To my understanding, you can't just forget your own SRP record because then you can't re-register yourself with the same SRP server anymore, under the same fqdn.

You really need to first remove your registration from the SRP server, and then re-register. Also, and if I remember correctly, when you claim a fqdn, you receive some sort of key from the SRP server, and the key and the fqdn are all saved (or should be saved) in the persistent storage of OpenThread => the persistent storage of rs-matter, so that they survive device reboot.

But perhaps I'm missing something, if you could elaborate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants