Skip to content

maheshmahadevan/practical-guide-to-celery-in-production

Repository files navigation

A Practical Guide to Celery in Production

This was my talk at PyCon India 2024 at Bengaluru, Nimhans Convention Center on 21st Sept 2024.

Link to the Proposal - https://in.pycon.org//cfp/2024/proposals/practical-guide-to-celery-in-production~aADMO/

YouTube - https://www.youtube.com/watch?v=S931rE_BKZs

Slides - Drive Github

Transcript Summary - transcript_summary.md

Transcript - transcript.txt

Blog - Coming soon...


Overview

This guide shares real-world lessons from running Celery at scale in a production SaaS environment.
It covers architecture, scaling, deployment, and monitoring strategies that enable a reliable, cost-efficient asynchronous task system on AWS.


1. Background

To process large volumes of audio, video, and analytics tasks, we rely heavily on Celery with AWS SQS as the message broker.


2. Architecture

High-Level Flow

  1. Django applications enqueue background tasks into SQS.
  2. EC2 Celery worker nodes poll SQS for messages.
  3. Workers execute tasks and acknowledge completion.

Worker Design

  • Workers run on AWS Spot Instances managed by Auto Scaling Groups (ASGs) for cost savings.

  • Queues are isolated by workload type:

    Queue Workload Instance Type Concurrency
    video CPU/memory intensive High-spec 1
    email, report Lightweight tasks Medium/low >1
  • Each queue maps to its own ASG through AWS tags.


3. Scaling Strategies

Strategy Description Benefit
Time-based Scale up during US hours, down during off-hours. Cost efficiency
Queue-depth Adjust instance count based on SQS message count. Reactive scaling
Predictive AWS predictive scaling uses ML to anticipate load. Proactive scaling

Handling Spot Terminations

  • AWS sends a 2-minute termination notice.
  • A cron-based shutdown hook stops Celery from accepting new tasks and finishes ongoing work.
  • EventBridge → Lambda → SSM triggers graceful stop commands during scale-in.

4. Production Scale

Metric Value
Celery tasks 200 +
Queues 70
Auto Scaling Groups 70
Peak queue depth 20 000 + messages
Active workers 100 – 500
Daily throughput over 1 M tasks

5. Deployment Workflow

Early versions used sequential Fabric SSH deployments (~2 hours).
We switched to a hybrid of Fabric + AWS CodeDeploy:

  1. Fabric builds and uploads versioned artifacts to S3.
  2. CodeDeploy rolls them out in parallel across all ASGs.
  3. Deployments complete in 5–7 minutes.

Deployed versions are verified through Grafana dashboards that query instance metadata via Athena.


6. Monitoring and Observability

Traditional tools like Celery Flower or Prometheus Exporter lacked native SQS support, so we built a custom stack:

Metrics Tracked

  • Queue message count
  • Age of oldest message
  • ASG activity (scale in/out)
  • Task latency (P50, P99, max)

Alerting

  • Oldest message age > threshold (e.g., 180 s for critical tasks)
  • Queue depth > defined limit

Stack: CloudWatch → Athena → Grafana


7. Task Configuration Tips

# celeryconfig.py
task_acks_late = True
worker_prefetch_multiplier = 1
  • Prevents workers from fetching more tasks than they can handle.
  • Ensures messages aren’t acknowledged until successfully processed.

SQS Visibility Timeout

Set visibility timeout to P99 task duration + 2–3 minutes buffer
to avoid duplicate processing on long-running tasks.


8. Lessons Learned

Area Key Takeaway
Scaling Predictive scaling reduces cold starts and cost.
Deployment Parallel CodeDeploy cut release time from 2 h → 7 min.
Spot handling Graceful shutdowns via EventBridge + Lambda + SSM prevent data loss.
Monitoring Custom Grafana dashboards provide better SQS visibility.

9. Ongoing Challenges

  • Graceful recovery for unfinished tasks after abrupt spot termination.
  • Efficient detection and retry of unacknowledged SQS messages.
  • Memory and performance profiling at large scale.
  • Cleaner observability for Celery + SQS (few native tools exist).
  • Seamless scaling with containerized workers.

10. Conclusion

Celery remains a powerful and flexible asynchronous framework, but running it at production scale requires thoughtful engineering.
By combining SQS, Auto Scaling, predictive capacity, and custom monitoring, we’ve achieved high reliability with significant cost efficiency.


2025 Mahesh Mahadevan — A Practical Guide to Celery in Production*

About

This was my Python talk during PyCon India 2024 in Bengaluru.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published