Skip to content

Conversation

@Nayana-R-Gowda
Copy link
Collaborator

@Nayana-R-Gowda Nayana-R-Gowda commented Jan 7, 2026

Signed-off-by: NAYANAR [email protected]
closes #1826

This change introduces a batched aggregation path for custom log search windows to avoid per-window recomputation. Instead of scanning log data separately for each window, the implementation computes metrics for all requested windows in a single pass over the full time range. For PostgreSQL, a single SQL query using generate_series and ordered-set aggregates computes per-window percentiles efficiently. For other databases, logs are fetched once per component/operation and bucketed into windows in Python. Existing aggregation semantics and filters are preserved, while significantly reducing CPU usage and database load for large custom ranges.

@Nayana-R-Gowda Nayana-R-Gowda changed the title This change replaces per-window aggregation with a single batch aggregation over the full time range. The new implementation avoids repeated scans of log data, significantly reducing CPU usage for large custom window ranges, while preserving existing aggregation semantics. Functional equivalence was validated against existing data; logs without duration or operation type are correctly ignored as before. Avoid per-window recomputation in log search custom windows #1826 Jan 7, 2026
@crivetimihai
Copy link
Member

crivetimihai commented Jan 7, 2026

Does this close #1826?

@Nayana-R-Gowda
Copy link
Collaborator Author

Yes, but a few test cases are failing, so I’m checking on that.

This change introduces a batched aggregation path for custom log search
windows to avoid per-window recomputation. Instead of scanning log data
separately for each window, the implementation computes metrics for all
requested windows in a single pass over the full time range.

For PostgreSQL, a single SQL query using generate_series and ordered-set
aggregates computes per-window percentiles efficiently. For other
databases, logs are fetched once per component/operation and bucketed
into windows in Python.

Key changes:
- Add aggregate_all_components_batch() method to LogAggregator service
- Modify _aggregate_custom_windows() to collect all windows first and
  delegate to batch method
- Implement proper fallback to per-window aggregation if batch fails
- Include db.rollback() before fallback to handle PostgreSQL failed state
- Filter duration_ms IS NOT NULL in batch SQL JOIN for consistent
  error_count semantics with per-window path
- Filter component/operation_type IS NOT NULL and non-empty in both
  PostgreSQL and Python paths for consistency with per-window behavior
- Add warning log for large ranges in non-PostgreSQL fallback path
- Add warning when sparse window_starts detected (batch generates full range)
- Limit window list to 10000 entries, keeping most recent windows
- Preserve existing aggregation semantics and filters
- Add comprehensive tests for batch aggregation parity

Performance impact:
- Before: N windows × M component/operation pairs = N×M database queries
- After PostgreSQL: 1 SQL query for all windows and pairs
- After SQLite: M queries regardless of window count

Closes #1826

Signed-off-by: Mihai Criveti <[email protected]>
@crivetimihai crivetimihai force-pushed the 1826_Log_Search_Optimization branch from 1467af8 to 888048f Compare January 8, 2026 09:06
@crivetimihai
Copy link
Member

Review and Fixes Applied

Rebased on main and addressed the following issues identified during code review:

High Priority Fixes

  • duration_ms filter in JOIN: Added AND sle.duration_ms IS NOT NULL to PostgreSQL batch JOIN to ensure error_count only includes rows counted in request_count (prevents error_rate > 1)
  • Empty component/operation filter: Added component IS NOT NULL AND component <> '' and same for operation_type in both PostgreSQL and Python paths to match per-window behavior

Medium Priority Fixes

  • db.rollback() before fallback: Added rollback in except block before falling back to per-window aggregation (required for PostgreSQL failed transaction state)
  • Window list truncation: Changed to keep most recent windows when truncating (window_starts[-max_windows:]) instead of oldest, ensuring fresh metrics are always computed

Low Priority Fixes

  • Memory safeguards: Added 10,000 window limit with warning, and warning for ranges > 168 hours in Python fallback
  • Sparse window detection: Added warning when len(window_starts) != expected_window_count to document that batch generates full contiguous range
  • Test assertions: Fixed two weak test assertions that could pass without validating behavior

Tests Added

  • TestAggregateAllComponentsBatch (8 tests): batch disabled, empty windows, PostgreSQL path, Python fallback, error_count consistency, large range warning
  • TestAggregateCustomWindowsFallback (3 tests): fallback on exception with rollback, no fallback on success, fallback when batch method missing

All 34 tests pass.

Copy link
Member

@crivetimihai crivetimihai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased, updated, tested

@crivetimihai crivetimihai merged commit 65ae9da into main Jan 8, 2026
52 checks passed
@crivetimihai crivetimihai deleted the 1826_Log_Search_Optimization branch January 8, 2026 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PERFORMANCE]: Avoid per-window recomputation in log search custom windows

3 participants