Move metrics rollup percentile computation to PostgreSQL #1927
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimize Hourly Rollups Using PostgreSQL-Native Percentile Computation
Performance & Stability Improvements
This PR addresses issue #1810 by improving the performance, memory efficiency, and concurrency safety of the hourly metrics rollup pipeline.
When running on PostgreSQL, hourly rollups now use database-native percentile functions (percentile_cont) to compute p50 / p95 / p99. This removes unnecessary Python-side computation, reduces CPU overhead, and consolidates rollups into a single grouped aggregation query. For non-Postgres databases, the existing Python-based percentile logic is preserved to maintain compatibility.
To prevent high memory usage and potential OOM crashes, aggregation queries now stream results in configurable batches using YIELD_BATCH_SIZE (default: 1000, env-configurable), instead of materializing the full result set in memory. This keeps memory usage predictable and stable under high traffic volumes.
Rollup writes are now fully dialect-aware via _upsert_rollup:
PostgreSQL and SQLite use native UPSERT semantics.
Other dialects fall back to a safe Python insert/update flow with explicit race-condition handling.
These changes prevent duplicate rows, race conditions, memory spikes, and intermittent rollup failures under concurrent load.