Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AIO version of EventHubConsumerClient has significant memory leak #36398

Open
smoke opened this issue Jul 9, 2024 · 11 comments
Open

AIO version of EventHubConsumerClient has significant memory leak #36398

smoke opened this issue Jul 9, 2024 · 11 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs Messaging Messaging crew needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@smoke
Copy link

smoke commented Jul 9, 2024

  • Package Name: azure-eventhub
  • Package Version: ^5.11.2 (maybe older as well)
  • Operating System: Linux
  • Python Version: 3.9, 3.12

Describe the bug
AIO version of EventHubConsumerClient has significant memory leak, when async code is used inside on_event

To Reproduce
Steps to reproduce the behavior:

  1. Use reference implementation from https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/eventhub/azure-eventhub/samples/async_samples/recv_async.py
  2. Consume some 1.3-5k events per minute for 10-15 min
  3. Observe the memory grows only

Expected behavior
Long running events processing to not leak memory.

Screenshots
AIO version memory usage over 12h:
(note the spike downs are POD restarts due to OOM killer)
image

Sync version memory usage over 12h:
image

Additional context
For example the sync implementation has no issues at all https://github.dev/Azure/azure-sdk-for-python/tree/main/sdk/eventhub/azure-eventhub/samples/sync_samples/recv.py

@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jul 9, 2024
Copy link

github-actions bot commented Jul 9, 2024

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@swathipil swathipil added the Messaging Messaging crew label Jul 9, 2024
@swathipil
Copy link
Member

Hi @smoke - Thanks for opening an issue! Are you using websocket, i.e. passing in TransportType.AmqpOverWebsocket? Or are you using the sample as is without any modifications?

@swathipil swathipil added needs-author-feedback More information is needed from author to address the issue. and removed needs-team-attention This issue needs attention from Azure service team or SDK team labels Jul 9, 2024
Copy link

github-actions bot commented Jul 9, 2024

Hi @smoke. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@smoke
Copy link
Author

smoke commented Jul 10, 2024

Sample as is, with added few lines to send the events through httpx to another service.

@github-actions github-actions bot added needs-team-attention This issue needs attention from Azure service team or SDK team and removed needs-author-feedback More information is needed from author to address the issue. labels Jul 10, 2024
@swathipil
Copy link
Member

@smoke Would you also be able to provide debug level logging to see if the memory jumps correlate with underlying errors?
To turn on logging:

import logging
import sys
logger = logging.getLogger("azure.eventhub")
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)

Pass in logging_enable=True during client construction: EventHubConsumerClient(.....logging_enable=True)

@swathipil swathipil added needs-author-feedback More information is needed from author to address the issue. and removed needs-team-attention This issue needs attention from Azure service team or SDK team labels Jul 10, 2024
Copy link

Hi @smoke. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

@bretttegart
Copy link

This memory leak is also impacting my application (forwarding large JSON audit events from Microsoft Entra to Azure Event Hub).

I reproduced the memory leak issue in a scratch environment. To do so, I wrote 200MB of dummy JSON data to a new event hub and downloaded it using the async reference implementation that Smoke linked in the issue. Debug logs from the receiver are attached.
debug.log

Unfortunately, the Sync implementation doesn't work in my environment due to a different issue with our dual-stack TCP outgoing proxy.

Please let me know if I can provide any additional context or logging!

@smoke
Copy link
Author

smoke commented Jul 15, 2024

@smoke Would you also be able to provide debug level logging to see if the memory jumps correlate with underlying errors? To turn on logging:

import logging
import sys
logger = logging.getLogger("azure.eventhub")
logger.setLevel(logging.DEBUG)
handler = logging.StreamHandler(stream=sys.stdout)
logger.addHandler(handler)

Pass in logging_enable=True during client construction: EventHubConsumerClient(.....logging_enable=True)

Please find the log file with debug logger debug.log.tar.gz

Here a graph of the memory usage that clearly shows the memory leak:
image

Notice that to have realistic scenario I have used the following on_event handler:

async def on_event(partition_context, event):
    # Put your code here.
    # If the operation is i/o intensive, async will have better performance.
    print("Received event from partition: {}.".format(partition_context.partition_id))

    ### changed code starts ###
    print('Simulating work')  # noqa: T201
    await asyncio.sleep(30000)
    ### changed code finishes ###

    await partition_context.update_checkpoint(event)

Update: I have tested with code that does not have any additional code and there is no memory leak, apparently the problem is when there is some async code inside the on_event.

Here the pristine code log
debug2.log.tar.gz

and the graph:
image

@github-actions github-actions bot added needs-team-attention This issue needs attention from Azure service team or SDK team and removed needs-author-feedback More information is needed from author to address the issue. labels Jul 15, 2024
@l0lawrence
Copy link
Member

Hi @smoke,

A few questions:

Update: I have tested with code that does not have any additional code and there is no memory leak, apparently the problem is when there is some async code inside the on_event.

  1. When you say "some async code" here are you referring to asyncio.sleep() specifically or have you run into this with other async code as well?
  2. Have you tried other sleep intervals other than 30,000 with different results? More info: After the first message from a consumer goes into the on_event you will continue to receive messages on that consumer and they will queue up in the background. When the on_event awakens and completes for the current message, it then takes the next one and sleeps again.
  3. What size are the events you are consuming?
  4. Are you using Azure functions or any other env configurations we should consider?

@smoke
Copy link
Author

smoke commented Jul 16, 2024

Hi @smoke,

A few questions:

Update: I have tested with code that does not have any additional code and there is no memory leak, apparently the problem is when there is some async code inside the on_event.

  1. When you say "some async code" here are you referring to asyncio.sleep() specifically or have you run into this with other async code as well?
  2. Have you tried other sleep intervals other than 30,000 with different results? More info: After the first message from a consumer goes into the on_event you will continue to receive messages on that consumer and they will queue up in the background. When the on_event awakens and completes for the current message, it then takes the next one and sleeps again.
  3. What size are the events you are consuming?
  4. Are you using Azure functions or any other env configurations we should consider?
  1. Yes, the memory leak happens with other regular Async code e.g. processing and sending the events using httpx.AsyncClient()
  2. The regular Async code I have succeeds within anything between 100ms to 10-20s and the result is the same. I guess the on_event is forcefully executed in parallel from the consumer, thus creating a back pressure of messages read but unprocessed. In my opinion parallel execution should be left for users to utilize, so they can properly handle the back pressure.
  3. Not sure, difficult to check exactly but it should be anything between 100kb to 1-2mb, if important I can check additional, let me know
  4. K8S containers with amd64 arch and regular EC2 instances in AWS EKS cluster, should be nothing out of ordinary

@l0lawrence
Copy link
Member

Hi @smoke we were able to reproduce your issue and are looking into a fix, will update when we have more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs Messaging Messaging crew needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

5 participants