Performance Optimization: Break Throughput Bottleneck (~32-34 req/s) with Batching, MLX Acceleration & Horizontal Scaling

### Issue Summary

OpenTextShield mBERT currently hits a CPU-only bottleneck of ~32-34 req/s on M4 Mac mini hardware. Load testing revealed that GPT-OSS-20B is 1000x slower, confirming mBERT's efficiency, but we need to break through the current throughput ceiling for production SMS filtering workloads.

### Background

• Current Performance: 32.55 req/s peak throughput (50 concurrent requests)
• Bottleneck Cause: Single-threaded mBERT inference + CPU resource limits
• Hardware: M4 Mac mini (12 cores, 24GB RAM, Neural Engine available)
• Test Results: See benchmark/opentextshield_vs_gpt-oss_load_test_comparison_Scenario_1.md

### Required Optimizations

#### 🚀 High Priority (Immediate Impact)

[ ] Implement Batch Processing: Modify inference pipeline to handle multiple requests per model call instead of single-threaded processing
[ ] Enable MLX/ANE Acceleration: Leverage Apple Silicon Neural Engine for GPU-accelerated inference (existing MLX support mentioned in codebase)
[ ] Horizontal Scaling: Deploy multiple Docker containers with load balancing using existing launch_multiple_containers.sh

####  Medium Priority (Production Readiness)

[ ] Test Batching + MLX Combo: Benchmark the combined effect for expected 5-10x throughput improvement
[ ] Reverse Proxy Setup: Configure Nginx/Traefik for load balancing across container instances
[ ] Autoscaling & Circuit Breakers: Add production deployment features for reliability
[ ] Monitoring Infrastructure: Implement latency histograms, saturation metrics, and error budgets

#### Low Priority (Observability)

[ ] Tail Latency Alerting: Add P95/P99 latency monitoring and alerts

### Expected Outcomes

• Throughput Target: 150-300+ req/s (5-10x improvement)
• Concurrent Capacity: Support 1000+ concurrent requests
• Production Readiness: Enterprise-grade SMS filtering capability

### Files to Reference

• Load test results: benchmark/results_load_test_opentextshield.json
• Hardware specs: benchmark/hardware_specs.md
• Full report: benchmark/opentextshield_vs_gpt-oss_load_test_comparison_Scenario_1.md
• Existing scaling script: launch_multiple_containers.sh

### Implementation Notes

• Start with batching + MLX acceleration for immediate gains
• Use horizontal scaling for linear throughput increases
• Test on M4 Mac mini before broader deployment
• Maintain backward compatibility with existing API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimization: Break Throughput Bottleneck (~32-34 req/s) with Batching, MLX Acceleration & Horizontal Scaling #128

Issue Summary

Background

Required Optimizations

🚀 High Priority (Immediate Impact)

Medium Priority (Production Readiness)

Low Priority (Observability)

Expected Outcomes

Files to Reference

Implementation Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance Optimization: Break Throughput Bottleneck (~32-34 req/s) with Batching, MLX Acceleration & Horizontal Scaling #128

Description

Issue Summary

Background

Required Optimizations

🚀 High Priority (Immediate Impact)

Medium Priority (Production Readiness)

Low Priority (Observability)

Expected Outcomes

Files to Reference

Implementation Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions