This was my talk at PyCon India 2024 at Bengaluru, Nimhans Convention Center on 21st Sept 2024.
Link to the Proposal - https://in.pycon.org//cfp/2024/proposals/practical-guide-to-celery-in-production~aADMO/
YouTube - https://www.youtube.com/watch?v=S931rE_BKZs
Transcript Summary - transcript_summary.md
Transcript - transcript.txt
Blog - Coming soon...
This guide shares real-world lessons from running Celery at scale in a production SaaS environment.
It covers architecture, scaling, deployment, and monitoring strategies that enable a reliable, cost-efficient asynchronous task system on AWS.
To process large volumes of audio, video, and analytics tasks, we rely heavily on Celery with AWS SQS as the message broker.
- Django applications enqueue background tasks into SQS.
- EC2 Celery worker nodes poll SQS for messages.
- Workers execute tasks and acknowledge completion.
-
Workers run on AWS Spot Instances managed by Auto Scaling Groups (ASGs) for cost savings.
-
Queues are isolated by workload type:
Queue Workload Instance Type Concurrency videoCPU/memory intensive High-spec 1 email,reportLightweight tasks Medium/low >1 -
Each queue maps to its own ASG through AWS tags.
| Strategy | Description | Benefit |
|---|---|---|
| Time-based | Scale up during US hours, down during off-hours. | Cost efficiency |
| Queue-depth | Adjust instance count based on SQS message count. | Reactive scaling |
| Predictive | AWS predictive scaling uses ML to anticipate load. | Proactive scaling |
- AWS sends a 2-minute termination notice.
- A cron-based shutdown hook stops Celery from accepting new tasks and finishes ongoing work.
- EventBridge → Lambda → SSM triggers graceful stop commands during scale-in.
| Metric | Value |
|---|---|
| Celery tasks | 200 + |
| Queues | 70 |
| Auto Scaling Groups | 70 |
| Peak queue depth | 20 000 + messages |
| Active workers | 100 – 500 |
| Daily throughput | over 1 M tasks |
Early versions used sequential Fabric SSH deployments (~2 hours).
We switched to a hybrid of Fabric + AWS CodeDeploy:
- Fabric builds and uploads versioned artifacts to S3.
- CodeDeploy rolls them out in parallel across all ASGs.
- Deployments complete in 5–7 minutes.
Deployed versions are verified through Grafana dashboards that query instance metadata via Athena.
Traditional tools like Celery Flower or Prometheus Exporter lacked native SQS support, so we built a custom stack:
Metrics Tracked
- Queue message count
- Age of oldest message
- ASG activity (scale in/out)
- Task latency (P50, P99, max)
Alerting
- Oldest message age > threshold (e.g., 180 s for critical tasks)
- Queue depth > defined limit
Stack: CloudWatch → Athena → Grafana
# celeryconfig.py
task_acks_late = True
worker_prefetch_multiplier = 1- Prevents workers from fetching more tasks than they can handle.
- Ensures messages aren’t acknowledged until successfully processed.
Set visibility timeout to P99 task duration + 2–3 minutes buffer
to avoid duplicate processing on long-running tasks.
| Area | Key Takeaway |
|---|---|
| Scaling | Predictive scaling reduces cold starts and cost. |
| Deployment | Parallel CodeDeploy cut release time from 2 h → 7 min. |
| Spot handling | Graceful shutdowns via EventBridge + Lambda + SSM prevent data loss. |
| Monitoring | Custom Grafana dashboards provide better SQS visibility. |
- Graceful recovery for unfinished tasks after abrupt spot termination.
- Efficient detection and retry of unacknowledged SQS messages.
- Memory and performance profiling at large scale.
- Cleaner observability for Celery + SQS (few native tools exist).
- Seamless scaling with containerized workers.
Celery remains a powerful and flexible asynchronous framework, but running it at production scale requires thoughtful engineering.
By combining SQS, Auto Scaling, predictive capacity, and custom monitoring, we’ve achieved high reliability with significant cost efficiency.
2025 Mahesh Mahadevan — A Practical Guide to Celery in Production*