-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Issue Summary
OpenTextShield mBERT currently hits a CPU-only bottleneck of ~32-34 req/s on M4 Mac mini hardware. Load testing revealed that GPT-OSS-20B is 1000x slower, confirming mBERT's efficiency, but we need to break through the current throughput ceiling for production SMS filtering workloads.
Background
• Current Performance: 32.55 req/s peak throughput (50 concurrent requests)
• Bottleneck Cause: Single-threaded mBERT inference + CPU resource limits
• Hardware: M4 Mac mini (12 cores, 24GB RAM, Neural Engine available)
• Test Results: See benchmark/opentextshield_vs_gpt-oss_load_test_comparison_Scenario_1.md
Required Optimizations
🚀 High Priority (Immediate Impact)
[ ] Implement Batch Processing: Modify inference pipeline to handle multiple requests per model call instead of single-threaded processing
[ ] Enable MLX/ANE Acceleration: Leverage Apple Silicon Neural Engine for GPU-accelerated inference (existing MLX support mentioned in codebase)
[ ] Horizontal Scaling: Deploy multiple Docker containers with load balancing using existing launch_multiple_containers.sh
Medium Priority (Production Readiness)
[ ] Test Batching + MLX Combo: Benchmark the combined effect for expected 5-10x throughput improvement
[ ] Reverse Proxy Setup: Configure Nginx/Traefik for load balancing across container instances
[ ] Autoscaling & Circuit Breakers: Add production deployment features for reliability
[ ] Monitoring Infrastructure: Implement latency histograms, saturation metrics, and error budgets
Low Priority (Observability)
[ ] Tail Latency Alerting: Add P95/P99 latency monitoring and alerts
Expected Outcomes
• Throughput Target: 150-300+ req/s (5-10x improvement)
• Concurrent Capacity: Support 1000+ concurrent requests
• Production Readiness: Enterprise-grade SMS filtering capability
Files to Reference
• Load test results: benchmark/results_load_test_opentextshield.json
• Hardware specs: benchmark/hardware_specs.md
• Full report: benchmark/opentextshield_vs_gpt-oss_load_test_comparison_Scenario_1.md
• Existing scaling script: launch_multiple_containers.sh
Implementation Notes
• Start with batching + MLX acceleration for immediate gains
• Use horizontal scaling for linear throughput increases
• Test on M4 Mac mini before broader deployment
• Maintain backward compatibility with existing API