A Practical Guide to Celery in Production

This was my talk at PyCon India 2024 at Bengaluru, Nimhans Convention Center on 21st Sept 2024.

Link to the Proposal - https://in.pycon.org//cfp/2024/proposals/practical-guide-to-celery-in-production~aADMO/

YouTube - https://www.youtube.com/watch?v=S931rE_BKZs

Slides - Drive Github

Transcript Summary - transcript_summary.md

Transcript - transcript.txt

Blog - Coming soon...

Overview

This guide shares real-world lessons from running Celery at scale in a production SaaS environment.
It covers architecture, scaling, deployment, and monitoring strategies that enable a reliable, cost-efficient asynchronous task system on AWS.

1. Background

To process large volumes of audio, video, and analytics tasks, we rely heavily on Celery with AWS SQS as the message broker.

2. Architecture

High-Level Flow

Django applications enqueue background tasks into SQS.
EC2 Celery worker nodes poll SQS for messages.
Workers execute tasks and acknowledge completion.

Worker Design

Workers run on AWS Spot Instances managed by Auto Scaling Groups (ASGs) for cost savings.
Queues are isolated by workload type:

Queue Workload Instance Type Concurrency

video CPU/memory intensive High-spec 1

email, report Lightweight tasks Medium/low >1
Each queue maps to its own ASG through AWS tags.

3. Scaling Strategies

Strategy	Description	Benefit
Time-based	Scale up during US hours, down during off-hours.	Cost efficiency
Queue-depth	Adjust instance count based on SQS message count.	Reactive scaling
Predictive	AWS predictive scaling uses ML to anticipate load.	Proactive scaling

Handling Spot Terminations

AWS sends a 2-minute termination notice.
A cron-based shutdown hook stops Celery from accepting new tasks and finishes ongoing work.
EventBridge → Lambda → SSM triggers graceful stop commands during scale-in.

4. Production Scale

Metric	Value
Celery tasks	200 +
Queues	70
Auto Scaling Groups	70
Peak queue depth	20 000 + messages
Active workers	100 – 500
Daily throughput	over 1 M tasks

5. Deployment Workflow

Early versions used sequential Fabric SSH deployments (~2 hours).
We switched to a hybrid of Fabric + AWS CodeDeploy:

Fabric builds and uploads versioned artifacts to S3.
CodeDeploy rolls them out in parallel across all ASGs.
Deployments complete in 5–7 minutes.

Deployed versions are verified through Grafana dashboards that query instance metadata via Athena.

6. Monitoring and Observability

Traditional tools like Celery Flower or Prometheus Exporter lacked native SQS support, so we built a custom stack:

Metrics Tracked

Queue message count
Age of oldest message
ASG activity (scale in/out)
Task latency (P50, P99, max)

Alerting

Oldest message age > threshold (e.g., 180 s for critical tasks)
Queue depth > defined limit

Stack: CloudWatch → Athena → Grafana

7. Task Configuration Tips

# celeryconfig.py
task_acks_late = True
worker_prefetch_multiplier = 1

Prevents workers from fetching more tasks than they can handle.
Ensures messages aren’t acknowledged until successfully processed.

SQS Visibility Timeout

Set visibility timeout to P99 task duration + 2–3 minutes buffer
to avoid duplicate processing on long-running tasks.

8. Lessons Learned

Area	Key Takeaway
Scaling	Predictive scaling reduces cold starts and cost.
Deployment	Parallel CodeDeploy cut release time from 2 h → 7 min.
Spot handling	Graceful shutdowns via EventBridge + Lambda + SSM prevent data loss.
Monitoring	Custom Grafana dashboards provide better SQS visibility.

9. Ongoing Challenges

Graceful recovery for unfinished tasks after abrupt spot termination.
Efficient detection and retry of unacknowledged SQS messages.
Memory and performance profiling at large scale.
Cleaner observability for Celery + SQS (few native tools exist).
Seamless scaling with containerized workers.

10. Conclusion

Celery remains a powerful and flexible asynchronous framework, but running it at production scale requires thoughtful engineering.
By combining SQS, Auto Scaling, predictive capacity, and custom monitoring, we’ve achieved high reliability with significant cost efficiency.

2025 Mahesh Mahadevan — A Practical Guide to Celery in Production*

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Practical Guide to Celery in Production - Slides.pptx		Practical Guide to Celery in Production - Slides.pptx
README.md		README.md
transcript.txt		transcript.txt
transcript_summary.md		transcript_summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Practical Guide to Celery in Production

Overview

1. Background

2. Architecture

High-Level Flow

Worker Design

3. Scaling Strategies

Handling Spot Terminations

4. Production Scale

5. Deployment Workflow

6. Monitoring and Observability

7. Task Configuration Tips

SQS Visibility Timeout

8. Lessons Learned

9. Ongoing Challenges

10. Conclusion

About

Uh oh!

Releases

Packages

Queue	Workload	Instance Type	Concurrency
`video`	CPU/memory intensive	High-spec	1
`email`, `report`	Lightweight tasks	Medium/low	>1

maheshmahadevan/practical-guide-to-celery-in-production

Folders and files

Latest commit

History

Repository files navigation

A Practical Guide to Celery in Production

Overview

1. Background

2. Architecture

High-Level Flow

Worker Design

3. Scaling Strategies

Handling Spot Terminations

4. Production Scale

5. Deployment Workflow

6. Monitoring and Observability

7. Task Configuration Tips

SQS Visibility Timeout

8. Lessons Learned

9. Ongoing Challenges

10. Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages