Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add sysinfo metrics #1139

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Jan 29, 2025

collect CPU usage, memory usage of the server
collect disk usage of the volume - data, staging, hot-tier
add these metrics to Prometheus Metrics

export these metrics to cluster metrics API
add the metrics to pmeta stream
add the querier node's sysinfo metrics to pmeta and cluster metrics API

Summary by CodeRabbit

  • New Features

    • Introduced enhanced system monitoring with detailed disk, memory, and CPU metrics.
    • Improved metrics initialization across server and cluster components, ensuring more robust observability.
  • Refactor

    • Streamlined internal processing for URL handling, event processing, and log ingestion, resulting in clearer error messaging and consistent performance.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 13025834726

Details

  • 0 of 390 (0.0%) changed or added relevant lines in 8 files are covered.
  • 4 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 12.761%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/handlers/http/modal/ingest_server.rs 0 1 0.0%
src/handlers/http/modal/server.rs 0 1 0.0%
src/handlers/http/modal/query_server.rs 0 2 0.0%
src/correlation.rs 0 3 0.0%
src/handlers/http/cluster/mod.rs 0 11 0.0%
src/handlers/http/ingest.rs 0 14 0.0%
src/metrics/mod.rs 0 174 0.0%
src/metrics/prom_utils.rs 0 184 0.0%
Files with Coverage Reduction New Missed Lines %
src/handlers/http/cluster/mod.rs 1 0.0%
src/metrics/mod.rs 3 0.0%
Totals Coverage Status
Change from base Build 13025432081: -0.2%
Covered Lines: 2477
Relevant Lines: 19411

💛 - Coveralls

collect CPU usage, memory usage of the server
collect disk usage of the volume - data, staging, hot-tier
add these metrics to Prometheus Metrics

export these metrics to cluster metrics API
add the metrics to pmeta stream
add the querier node's sysinfo metrics to pmeta and cluster metrics API
Copy link

coderabbitai bot commented Apr 1, 2025

Walkthrough

This pull request introduces several refactorings and feature enhancements across multiple components. The CLI module now employs a modular URL-building approach with improved error handling and helper methods. The event handling and stream operations drop an unnecessary parameter to simplify data processing. Several HTTP handler modules have been updated to initialize asynchronous system and cluster metrics schedulers with clearer control flows. In addition, the metrics module has been expanded to include comprehensive system, disk, memory, and CPU tracking, alongside enhanced Prometheus sample processing.

Changes

File(s) Change Summary
src/cli.rs Refactored get_url to use new helper methods (get_endpoint, parse_endpoint, resolve_env_var, build_url) for improved modularity and error handling.
src/event/mod.rs, src/parseable/streams.rs Removed stream_type parameter from method signatures and calls to simplify conditional logic in event processing and record pushing.
src/handlers/http/cluster/mod.rs, src/handlers/http/modal/query_server.rs, src/handlers/http/modal/server.rs, src/handlers/http/modal/ingest_server.rs Updated metrics scheduler calls by converting functions to asynchronous, fixing naming inconsistencies, and adding a new system metrics scheduler initialization.
src/handlers/http/ingest.rs Simplified ingest_internal_stream by removing schema fetching and event creation logic; now directly calls flatten_and_push_logs.
src/metrics/mod.rs, src/metrics/prom_utils.rs Enhanced metrics collection with new static metrics and structures for disk, memory, and CPU tracking. Added functions for asynchronous system metrics scheduling and refined Prometheus sample processing with a new MetricType enum and extended Metrics struct.

Sequence Diagram(s)

sequenceDiagram
    participant Server
    participant Handler
    participant Scheduler
    participant Metrics
    
    Server->>Handler: init() / start server
    Handler->>Scheduler: init_system_metrics_scheduler()
    Scheduler->>Metrics: collect_all_metrics()
    Metrics-->>Scheduler: return metrics data
    Scheduler-->>Handler: scheduled metrics update
Loading

Possibly related PRs

Suggested labels

for next release

Suggested reviewers

  • de-sh

Poem

I'm a rabbit on the run,
Hoping through the code with fun,
New features bloom beneath the sun,
Metrics and URLs, all finely spun 🐇,
With clean, sharp paths, coding's never done!
Happy hops in every run!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
src/parseable/streams.rs (2)

133-155: Consider returning a more descriptive error instead of panicking.

This block is generally correct and well-structured; however, you call .expect("File and RecordBatch both are checked") instead of returning a more descriptive StagingError. In production, this might abruptly terminate the application instead of gracefully propagating the error. As a follow-up, you could replace the .expect(...) with a result-based error handling approach to facilitate better error diagnosis.

-                let mut writer = DiskWriter::try_new(file_path, &record.schema(), range)
-                    .expect("File and RecordBatch both are checked");
+                let mut writer = DiskWriter::try_new(file_path, &record.schema(), range)
+                    .map_err(|e| {
+                        StagingError::Create
+                    })?;

1047-1047: Use a more descriptive .expect(...) in tests.

Using .unwrap() in a test is acceptable, but for clarity, a more descriptive message can help diagnose failures:

-    staging.push("abc", &batch, time, &HashMap::new()).unwrap();
+    staging.push("abc", &batch, time, &HashMap::new())
+        .expect("Failed to push record batch to staging during test");
src/handlers/http/cluster/mod.rs (1)

874-874: Include error handling for querier metrics if necessary.

Adding all_metrics.push(Metrics::querier_prometheus_metrics().await); is a solid enhancement. Consider wrapping it if there's any chance of error or unavailability from querier_prometheus_metrics() so that the entire metrics collection doesn't silently fail if the querier is unreachable.

src/cli.rs (3)

442-451: Provide user-friendly fallback or logs on invalid config.

The logic in get_url is concise, routing between get_endpoint and build_url. However, it uses a panic! in get_endpoint for invalid input. Consider returning a Result<Url, SomeConfigError> or logging more details to help users correct their configurations in production deployments.


468-482: Parsing by splitting on “:” alone may limit IPv6 support.

The code splits on “:” to separate hostname from port, which won’t handle IPv6 addresses gracefully. If you foresee IPv6 usage, consider an established parser or additional checks to handle bracketed IPv6 addresses.


507-516: Fail-fast parsing with a helpful panic message is appropriate.

build_url clarifies errors for misconfigured addresses. This is fine as a default, but if you anticipate frequently needing dynamic reconfiguration, consider returning a Result<Url, ConfigError> to handle misconfigurations more gracefully at runtime.

src/metrics/prom_utils.rs (2)

65-69: Ensure testing of newly introduced fields
The newly introduced fields for disk, memory, and CPU usage in the Metrics struct are clear and well-defined. Ensure they’re fully covered in unit tests to verify the integration logic, especially verifying default values and changes upon metric updates.

Also applies to: 98-119


124-131: Consider adding doc comments
The MetricType enum and its from_metric implementation are well-structured. To aid maintainability, consider adding Rust doc comments explaining each variant’s purpose and usage, as these mappings play a crucial role in the metrics pipeline.

Also applies to: 134-174

src/metrics/mod.rs (1)

195-230: Suggest verifying naming consistency
The newly declared disk, memory, and CPU gauges and counters are logically grouped under the same namespace. As a nitpick, ensure consistent naming conventions (e.g., “_disk” vs. “_usage”) to minimize confusion in dashboards.

Also applies to: 280-294

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 382f480 and 6f96079.

📒 Files selected for processing (10)
  • src/cli.rs (1 hunks)
  • src/event/mod.rs (0 hunks)
  • src/handlers/http/cluster/mod.rs (2 hunks)
  • src/handlers/http/ingest.rs (2 hunks)
  • src/handlers/http/modal/ingest_server.rs (2 hunks)
  • src/handlers/http/modal/query_server.rs (2 hunks)
  • src/handlers/http/modal/server.rs (2 hunks)
  • src/metrics/mod.rs (4 hunks)
  • src/metrics/prom_utils.rs (4 hunks)
  • src/parseable/streams.rs (2 hunks)
💤 Files with no reviewable changes (1)
  • src/event/mod.rs
🧰 Additional context used
🧬 Code Definitions (8)
src/handlers/http/modal/ingest_server.rs (1)
src/metrics/mod.rs (1)
  • init_system_metrics_scheduler (386-406)
src/handlers/http/modal/server.rs (1)
src/metrics/mod.rs (1)
  • init_system_metrics_scheduler (386-406)
src/handlers/http/modal/query_server.rs (2)
src/handlers/http/cluster/mod.rs (1)
  • init_cluster_metrics_scheduler (878-927)
src/metrics/mod.rs (1)
  • init_system_metrics_scheduler (386-406)
src/parseable/streams.rs (2)
src/utils/time.rs (2)
  • granularity_range (267-282)
  • new (59-61)
src/parseable/staging/writer.rs (1)
  • try_new (57-72)
src/cli.rs (5)
src/option.rs (1)
  • mode (127-135)
src/storage/object_storage.rs (1)
  • get_endpoint (74-74)
src/storage/localfs.rs (1)
  • get_endpoint (83-85)
src/storage/azure_blob.rs (1)
  • get_endpoint (192-194)
src/storage/s3.rs (1)
  • get_endpoint (322-324)
src/metrics/prom_utils.rs (1)
src/metrics/mod.rs (2)
  • get_system_metrics (509-542)
  • get_volume_disk_usage (454-480)
src/handlers/http/cluster/mod.rs (2)
src/metrics/mod.rs (1)
  • collect_all_metrics (409-417)
src/metrics/prom_utils.rs (2)
  • querier_prometheus_metrics (203-252)
  • new (79-121)
src/metrics/mod.rs (1)
src/metrics/prom_utils.rs (1)
  • new (79-121)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: coverage
🔇 Additional comments (21)
src/handlers/http/modal/ingest_server.rs (2)

31-31: LGTM: Added import for system metrics scheduler

Added import for the new system metrics scheduling functionality that will be used in the init method.


114-114: Added system metrics collection to ingest server

This adds the ability to collect and monitor system metrics (CPU, memory, disk usage) on the ingestor node, which aligns with the PR's objective of enhancing monitoring capabilities. This metrics collection is initialized during server startup after metadata is stored.

The implementation properly follows an asynchronous pattern and handles potential errors by propagating them upward with the ? operator, which ensures the server won't start if metrics initialization fails.

src/handlers/http/modal/server.rs (2)

33-33: LGTM: Added import for system metrics scheduler

Added import for the system metrics scheduling functionality that will be used in the init method.


138-139: Added system metrics collection to the main server

This adds the ability to collect and monitor system metrics (CPU, memory, disk usage) on the server node, which aligns with the PR's objective of enhancing monitoring capabilities. The metrics collection is properly initialized during server startup after analytics initialization.

The implementation follows an asynchronous pattern with proper error handling via the ? operator, ensuring the server won't start if metrics initialization fails. It's appropriately placed after the analytics initialization but before spawning the server tasks.

src/handlers/http/ingest.rs (2)

31-31: Updated import to include LogSourceEntry

Updated import for format-related types to match their usage in the code.


127-127: Simplified internal stream processing

Replaced the previous schema-based event processing with a more streamlined approach using flatten_and_push_logs. This simplification eliminates the need for schema retrieval and additional transformation steps when ingesting internal stream data.

This approach is more consistent with how other stream ingestion is handled throughout the codebase and reduces the complexity of the ingest_internal_stream function. Since this is used for internal pmeta streams which include the system metrics data being added in this PR, the simplified approach should make it easier to reliably capture the metrics.

src/handlers/http/modal/query_server.rs (3)

22-22: Fixed typo in cluster metrics scheduler import

Corrected the function name from "schedular" to "scheduler" in the import statement.


28-28: Added import for system metrics scheduler

Added import for the system metrics scheduling functionality that will be used in the init method.


118-120: Improved metrics initialization sequence

Enhanced the query server initialization by adding system metrics collection and improving the cluster metrics initialization flow. The previous conditional approach has been replaced with a more robust sequential initialization pattern.

This implementation:

  1. Initializes system metrics collection (CPU, memory, disk usage) on the querier node
  2. Initializes cluster metrics collection from all ingestors
  3. Uses proper error handling with the ? operator to ensure the server won't start if either metrics initialization fails

This change aligns perfectly with the PR's objective of adding sysinfo metrics from the querier node to both pmeta and cluster metrics API, enhancing the overall monitoring capabilities of the system.

src/handlers/http/cluster/mod.rs (3)

42-42: Imports look good.

Introducing collect_all_metrics here properly ties together the metric collection logic used later in this file.


878-878: Asynchronous scheduler initialization appears sound.

Converting init_cluster_metrics_scheduler into an async function aligns it well with asynchronous tasks. This helps ensure non-blocking operation when scheduling metrics collection.


885-887: Good error logging addition.

Capturing and logging errors from collect_all_metrics() provides better visibility. Ensure sensitive data is not included in any custom error messages returned from collect_all_metrics() to avoid accidental PII leakage.

src/cli.rs (2)

453-466: Panic-based validation for endpoints is acceptable but strict.

The new get_endpoint function panics if the endpoint string includes “http” or if environment variables are malformed. This is a valid fail-fast strategy, but you might consider a more robust error mechanism if there are advanced use cases (e.g., IPv6 or over-sanitized environment variables).


484-505: Environment variable resolution strategy looks suitable.

Resolving the $VARNAME pattern is convenient. The panic path is again a valid approach if environment variables are strictly required at startup. If you plan to allow optional environment variables or partial expansions, you might expand the logic accordingly.

src/metrics/prom_utils.rs (3)

19-21: Imports look good
No issues observed with introducing HashMap and Path here, as they align well with the new disk and memory usage collection logic.


22-45: No immediate concerns
These added imports from crate::about::current, along with system metrics and disk usage utilities, seem necessary and correctly scoped.


254-361: Increase test coverage for metric processing
The build_metrics_from_samples, process_gauge_metric, and related helper functions correctly process Prometheus samples into Metrics. However, there's a significant gap in test coverage. Implement targeted tests confirming each metric maps correctly (especially disk and memory usage) to ensure reliability.

Would you like me to generate a specialized test module for these functions?

src/metrics/mod.rs (4)

21-29: Imports and constants are fine
No issues with the newly introduced imports and constants for scheduling intervals and HTTP integration.

Also applies to: 34-35, 39-39


358-383: Struct definitions are clear
DiskMetrics, SystemMetrics, MemoryMetrics, and CpuMetrics are well-defined. Using simple numeric fields is efficient for direct usage in Prometheus metrics.


436-451: Robust disk and system metrics collection
The functions for collecting disk usage, memory, and CPU metrics are well-structured and straightforward. Current usage of the sysinfo crate is appropriate. Just ensure you handle large disk sizes and high CPU usage gracefully when parsing into i64 or f64.

Please confirm large disk or CPU usage values are tested to avoid overflow or unexpected truncation issues.

Also applies to: 453-480, 482-506, 508-542


385-406:

❓ Verification inconclusive

Potential concurrency consideration
init_system_metrics_scheduler spawns a loop sleeping every 10 seconds but uses an interval of 1 minute. Confirm that the more frequent loop wake-up won’t cause concurrency or performance overhead.

Also applies to: 408-417, 419-434


🏁 Script executed:

#!/bin/bash
# We will check if there are any references to concurrency control or repeated scheduling calls
rg "init_system_metrics_scheduler" -A 10

Length of output: 5324


Concurrency Consideration – Verify Scheduler Wake-Up Frequency
In src/metrics/mod.rs, the init_system_metrics_scheduler function spawns an async loop that calls scheduler.run_pending() every 10 seconds, while the metrics collection is scheduled using SYSTEM_METRICS_INTERVAL_SECONDS (typically a one-minute interval). Please confirm that this more frequent wake-up does not introduce any concurrency-related overhead or performance degradation. The same pattern appears in the code blocks at lines 408–417 and 419–434, so verifying that these intervals remain lightweight under load is advisable.

Comment on lines +203 to +247
pub async fn querier_prometheus_metrics() -> Self {
let mut metrics = Metrics::new(
PARSEABLE.options.get_url(Mode::Query).to_string(),
"querier".to_string(),
);

let system_metrics = get_system_metrics().expect("Failed to get system metrics");

metrics.parseable_memory_usage.total = system_metrics.memory.total;
metrics.parseable_memory_usage.used = system_metrics.memory.used;
metrics.parseable_memory_usage.total_swap = system_metrics.memory.total_swap;
metrics.parseable_memory_usage.used_swap = system_metrics.memory.used_swap;
for cpu_usage in system_metrics.cpu {
metrics
.parseable_cpu_usage
.insert(cpu_usage.name.clone(), cpu_usage.usage);
}

let staging_disk_usage = get_volume_disk_usage(PARSEABLE.options.staging_dir())
.expect("Failed to get staging volume disk usage");

metrics.parseable_staging_disk_usage.total = staging_disk_usage.total;
metrics.parseable_staging_disk_usage.used = staging_disk_usage.used;
metrics.parseable_staging_disk_usage.available = staging_disk_usage.available;

if PARSEABLE.get_storage_mode_string() == "Local drive" {
let data_disk_usage =
get_volume_disk_usage(Path::new(&PARSEABLE.storage().get_endpoint()))
.expect("Failed to get data volume disk usage");

metrics.parseable_data_disk_usage.total = data_disk_usage.total;
metrics.parseable_data_disk_usage.used = data_disk_usage.used;
metrics.parseable_data_disk_usage.available = data_disk_usage.available;
}

if PARSEABLE.options.hot_tier_storage_path.is_some() {
let hot_tier_disk_usage =
get_volume_disk_usage(PARSEABLE.hot_tier_dir().as_ref().unwrap())
.expect("Failed to get hot tier volume disk usage");

metrics.parseable_hot_tier_disk_usage.total = hot_tier_disk_usage.total;
metrics.parseable_hot_tier_disk_usage.used = hot_tier_disk_usage.used;
metrics.parseable_hot_tier_disk_usage.available = hot_tier_disk_usage.available;
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use more robust error handling
Methods like querier_prometheus_metrics rely on .expect(...) for retrieving system metrics and disk usage. While this is straightforward, it can crash the entire server on errors. Consider gracefully handling failures (e.g., logging a warning and proceeding) to avoid downtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants