Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Getting coreMQTT: Collision when starting demo #123

Open
gamename opened this issue Dec 17, 2024 · 12 comments
Open

[BUG] Getting coreMQTT: Collision when starting demo #123

gamename opened this issue Dec 17, 2024 · 12 comments

Comments

@gamename
Copy link

Describe the bug
I was able to create/download the AWS certs, build the firmware, and successfully connect. However, I keep getting this error in the log:

E (4203397) coreMQTT: Collision when adding PacketID=3 at index=0.
I (4203407) sub_pub_unsub_demo: coreMQTT-Agent connected.
I (4203407) temp_sub_pub_and_led_control_demo: coreMQTT-Agent connected.
I (4203417) ota_over_mqtt_demo: coreMQTT-Agent connected. Resuming OTA agent.
I (4203427) core_mqtt_agent_manager: coreMQTT-Agent connected.
E (4203477) network_transport: Connection closed
E (4203477) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTRecvFailed
I (4203477) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (4203487) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (4203497) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (4203507) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (4203517) core_mqtt_agent_manager: TLS connection was disconnected.
I (4204657) core_mqtt_agent_manager: TLS connection established.
I (4205027) coreMQTT: MQTT connection established with the broker.
I (4205027) core_mqtt_agent_manager: Session present: 1

E (4205037) coreMQTT: Collision when adding PacketID=3 at index=0.
I (4205047) sub_pub_unsub_demo: coreMQTT-Agent connected.
I (4205047) temp_sub_pub_and_led_control_demo: coreMQTT-Agent connected.
I (4205057) ota_over_mqtt_demo: coreMQTT-Agent connected. Resuming OTA agent.
I (4205057) core_mqtt_agent_manager: coreMQTT-Agent connected.
W (4205067) sub_pub_unsub_demo: Error or timed out waiting for ack to subscribe message 1. Re-attempting subscribe.
I (4205077) sub_pub_unsub_demo: Task "SubPub0" sending subscribe request to coreMQTT-Agent for topic filter: /filter/SubPub0 with id 1
E (4205107) network_transport: Connection closed
E (4205107) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTRecvFailed
I (4205107) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (4205127) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (4205137) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (4205147) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (4205157) core_mqtt_agent_manager: TLS connection was disconnected.
I (4206107) core_mqtt_agent_manager: TLS connection established.
I (4206357) coreMQTT: MQTT connection established with the broker.
I (4206357) core_mqtt_agent_manager: Session present: 1

System information

  • Hardware board: esp32-c3
  • IDE used: vscode
  • Operating System: MacOS
  • Code version: my own fork from the current version
  • Project/Demo: mqtt_demo_mutual_auth

Expected behavior
I don't expect to see MQTT message collisions in the log

Screenshots or console output
See above

Additional context
I had some trouble getting the proper policies for the MQTT topics. I ended up using this one:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iot:Connect",
      "Resource": "arn:aws:iot:us-east-1:657592885771:client/iot-reference-esp32-test"
    },
    {
      "Effect": "Allow",
      "Action": [
        "iot:Publish",
        "iot:Receive",
        "iot:Subscribe"
      ],
      "Resource": [
        "arn:aws:iot:us-east-1:657592885771:topic/filter/*",
        "arn:aws:iot:us-east-1:657592885771:topicfilter/filter/*"
      ]
    }
  ]
}
@jasonpcarroll
Copy link
Member

Hmm. It's possible you may need to tweak some network settings. Collision is most likely caused by either the device not acking a packet from AWS before AWS re-sends it (i.e. AWS sends a packet, device receives packet but sends its ack back too late so AWS resends the packet the device had already received, causing collision) or AWS not sending requested packets back to the device in time before the device retries (i.e. device requests some sort of response, AWS sends the response but device does not receive the response before it retries, device receives the initial response that was sent, and later gets the second response it requested on retry, leading to collision). I will look into how to adjust this project's settings so you can try some things.

Best,

Jason Carroll

@gamename
Copy link
Author

Hmm. It's possible you may need to tweak some network settings. Collision is most likely caused by either the device not acking a packet from AWS before AWS re-sends it (i.e. AWS sends a packet, device receives packet but sends its ack back too late so AWS resends the packet the device had already received, causing collision) or AWS not sending requested packets back to the device in time before the device retries (i.e. device requests some sort of response, AWS sends the response but device does not receive the response before it retries, device receives the initial response that was sent, and later gets the second response it requested on retry, leading to collision). I will look into how to adjust this project's settings so you can try some things.

Best,

Jason Carroll

Thanks!

@WilliamFrasson
Copy link

WilliamFrasson commented Dec 26, 2024

Hi all
I got the same kind of error. It's the first time I see it, while using the same wifi connection for almost a year. I see the coreMQTT collision error during OTA download; I did the same test with the same code a few months ago and didn't get any of these errors. Was there any update in the AWS job manager?

here my log:

2024-12-25 19:17:16.497] I (232682) MAIN: Send TOPIC: FLTE20/EWB/out/memcheck MESSAGE: {"time": 232366, "portime": 232, "HeapFree": 61856, "MinHeapFree": 5408, "fwVer": "1.8.0", "rssi": -64}
[2024-12-25 19:17:16.566] I (232752) OTA_OVER_MQTT: Sent PUBLISH packet to broker $aws/things/FLTE20/jobs/start-next to broker.
[2024-12-25 19:17:16.566]
[2024-12-25 19:17:16.567]
[2024-12-25 19:17:18.112] E (234302) coreMQTT: Handling of keep alive failed. Status=MQTTKeepAliveTimeout
[2024-12-25 19:17:18.113] E (234302) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTKeepAliveTimeout

[2024-12-25 19:17:18.113] I (234302) OTA_OVER_MQTT: coreMQTT-Agent disconnected. Suspending OTA agent.
[2024-12-25 19:17:18.113] I (234302) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
[2024-12-25 19:17:18.113] I (234302) MQTT_UTILS: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
[2024-12-25 19:17:18.133] I (234322) core_mqtt_agent_manager: TLS connection was disconnected.
[2024-12-25 19:17:18.492] I (234682) MAIN: Publish MemCheck SUCCESS
[2024-12-25 19:17:18.492] I (234682) MAIN: Check for connection
[2024-12-25 19:17:21.543] E (237732) esp-tls: [sock=56] select() timeout
[2024-12-25 19:17:21.544] E (237732) esp-tls: Failed to open new connection

[2024-12-25 19:17:21.852] I (238042) core_mqtt_agent_manager: Retry attempt 1.
[2024-12-25 19:17:24.961] I (241142) core_mqtt_agent_manager: TLS connection established.
[2024-12-25 19:17:25.412] I (241602) coreMQTT: MQTT connection established with the broker.
[2024-12-25 19:17:25.417] I (241602) core_mqtt_agent_manager: Session present: 1
[2024-12-25 19:17:25.417]
[2024-12-25 19:17:25.417] E (241602) coreMQTT: Collision when adding PacketID=27 at index=0.
[2024-12-25 19:17:25.421] I (241602) OTA_OVER_MQTT: coreMQTT-Agent connected. Resuming OTA agent.
[2024-12-25 19:17:25.424] I (241602) core_mqtt_agent_manager: coreMQTT-Agent connected.
[2024-12-25 19:17:25.424] I (241602) MQTT_UTILS: coreMQTT-Agent connected.
[2024-12-25 19:17:25.754] I (241942) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
[2024-12-25 19:17:25.756] I (241942) coreMQTT: State record updated. New state=MQTTPublishDone.
[2024-12-25 19:17:25.756] I (241942) MQTT_UTILS: Task "MQTT_APPL" QoS1 Publish 23 (P23:F0). SUCCESS
[2024-12-25 19:17:25.756] I (241942) MQTT_UTILS: Send Response into respQueue: PublishMessageId=23, retVal=1
[2024-12-25 19:17:25.853] I (242042) MQTT_UTILS: xMQTTPubishMessagesQueue: send MQTT Message (Qos=1), Topic: FLTE20/EWB/out/memcheck, Message: {"time": 232366, "portime": 232, "HeapFree": 61856, "MinHeapFree": 5408, "fwVer": "1.8.0", "rssi": -64}, timeout: 2000 msecs

@jasonpcarroll
Copy link
Member

jasonpcarroll commented Dec 31, 2024

Hi @WilliamFrasson,

Hmm... well that gives another data point. I apologize for the delay - I am still looking into this, but am waiting on an ESP32-C3 to come in. So what @WilliamFrasson mentioned (in that it worked a few months ago) - I wonder if it has anything to do with this PR https://github.com/FreeRTOS/iot-reference-esp32/pull/104/files. Could you try altering the values of the timeout and see if that fixes anything?

Best,

Jason Carroll

@orlyprofili
Copy link

I was not getting this with esp32c6 devkit yesterday, but now see it when I switched to esp32s3 devkit. Power cycling the devkit, erase/reflash, closing/reopening connection on aws had little effect (see below).

I switched back to esp32c6 and confirm it's still ok. Then back to esp32s3, and now it's ok. So it may be intermittent, I will update if I find the failure mode.

I applied the same aws certs to the esp32s3 as I did the esp32c6, and running main branch as of today [f4c15d5].

I also notice that when it's failing as shown below the iteration count that appears in aws console seems to alternate between a low single digit count and a count offset by about 10430, example, 1, 14030, 2, 10431. This is not consistent, some sessions it seemed to count normally.

I can control the LED but it takes several pub attempts to get through to it when it's in this state..

I (144033) sub_pub_unsub_demo: coreMQTT-Agent connected.
I (144043) ota_over_mqtt_demo: coreMQTT-Agent connected. Resuming OTA agent.
I (144043) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (144053) temp_sub_pub_and_led_control_demo: coreMQTT-Agent connected.
E (145363) network_transport: Connection closed
E (145363) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTRecvFailed
I (145363) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (145373) core_mqtt_agent_manager: TLS connection was disconnected.
I (145373) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (145393) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (145393) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (146413) core_mqtt_agent_manager: TLS connection established.
I (146793) coreMQTT: MQTT connection established with the broker.
I (146793) core_mqtt_agent_manager: Session present: 1

I (146793) sub_pub_unsub_demo: coreMQTT-Agent connected.
I (146803) ota_over_mqtt_demo: coreMQTT-Agent connected. Resuming OTA agent.
I (146803) core_mqtt_agent_manager: coreMQTT-Agent connected.
I (146813) temp_sub_pub_and_led_control_demo: coreMQTT-Agent connected.
I (146823) temp_sub_pub_and_led_control_demo: Sending publish request to agent with message "{"temperatureSensor":{ "taskName": "TempSubPubLED", "temperatureValue": 0.000000, "iteration": 22}}" on topic "/filter/TempSubPubLED"
I (146843) temp_sub_pub_and_led_control_demo: Task TempSubPubLED waiting for publish 22 to complete.
I (146973) coreMQTT: Ack packet deserialized with result: MQTTSuccess.
I (146973) coreMQTT: State record updated. New state=MQTTPublishDone.
I (146983) temp_sub_pub_and_led_control_demo: Rx'ed ack for QoS1 publish from Tx to /filter/TempSubPubLED (P23:F0).
I (147013) coreMQTT: De-serialized incoming PUBLISH packet: DeserializerResult=MQTTSuccess.
I (147013) coreMQTT: State record updated. New state=MQTTPubAckSend.
I (147023) temp_sub_pub_and_led_control_demo: Received incoming publish message {"temperatureSensor":{ "taskName": "TempSubPubLED", "temperatureValue": 0.000000, "iteration": 22}}
E (148133) network_transport: Connection closed
E (148133) coreMQTT: Call to receiveSingleIteration failed. Status=MQTTRecvFailed
I (148133) sub_pub_unsub_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (148143) ota_over_mqtt_demo: coreMQTT-Agent disconnected. Suspending OTA agent.
I (148153) core_mqtt_agent_manager: coreMQTT-Agent disconnected.
I (148153) temp_sub_pub_and_led_control_demo: coreMQTT-Agent disconnected. Preventing coreMQTT-Agent commands from being enqueued.
I (148163) core_mqtt_agent_manager: TLS connection was disconnected.
I (149203) core_mqtt_agent_manager: TLS connection established.
I (149383) coreMQTT: MQTT connection established with the broker.
I (149383) core_mqtt_agent_manager: Session present: 1

@jasonpcarroll
Copy link
Member

@orlyprofili
When you say iteration count - what exactly are you referring to?

@orlyprofili
Copy link

@orlyprofili
When you say iteration count - what exactly are you referring to?

In aws console webview, viewing the subscribed temperature message as they came in, there is an iteration counter as part of each temperature message. Normally it increments, but when it was acting up, it patterned something like what I described above.

@jasonpcarroll
Copy link
Member

@orlyprofili I still have not been able to reproduce this error... I will take a deeper look at the code next week... seems to be some bug there with the iteration count ... I wonder if commands are going out of scope to the coreMQTT-Agent (when you make a command to coreMQTT-Agent one must ensure that the memory passed to the command stays in scope - otherwise bad things ensue as memory is corrupted.

@orlyprofili
Copy link

@orlyprofili I still have not been able to reproduce this error... I will take a deeper look at the code next week... seems to be some bug there with the iteration count ... I wonder if commands are going out of scope to the coreMQTT-Agent (when you make a command to coreMQTT-Agent one must ensure that the memory passed to the command stays in scope - otherwise bad things ensue as memory is corrupted.

Thanks. I haven't observed it since either. Will share if it comes up again.

@jasonpcarroll
Copy link
Member

jasonpcarroll commented Mar 27, 2025

@orlyprofili,

Have not had time to look closer into it yet. I am going to try to next week

Best.

@jasonpcarroll
Copy link
Member

@orlyprofili, from inspection I did not see any places where coreMQTT-Agent command contexts went out of scope. I also have not been able to replicate your issue.... definitely seems like it shows up more on the S3. Hmm.. I think I will check task stack settings next... wonder if something is possibly bleeding over from one task to another.

@orlyprofili
Copy link

@jasonpcarroll thanks for looking, I've shifted to some other tasks in the past week or so. If I observe this again when I come back to this stuff, I'll post here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants