Skip to content

Conversation

BaldDemian
Copy link

@BaldDemian BaldDemian commented May 14, 2025

Rationale

Close #1577

Detailed Changes

  • In src/server/src/http.rs, register the pprof api, similar to the existing /debug/profile/heap/{seconds}.
  • In src/components/profile/src/lib.rs, implement dump_heap_pprof function, similar to the existing dump_heap_prof.

Test Plan

Manual test. More details are given below.

@github-actions github-actions bot added the feature New feature or request label May 14, 2025
@BaldDemian
Copy link
Author

BaldDemian commented May 14, 2025

how to test

  1. In shell 1, build and run the HoraeDB server with heap profiling enabled:

    cargo build
    export MALLOC_CONF="prof:true,prof_active:false,lg_prof_sample:19"
    ./target/debug/horaedb-server --config docs/minimal.toml
  2. In shell 2, run

    curl localhost:5440/debug/pprof/heap/15 > heap.pb.gz
  3. Install pprof and Graphviz. You can generate report files(txt, svg...) using the pprof command and the heap.pb.gz file. Generally, running pprof commands in machines other than the original HoraeDB machine may see symbolization faiure like this:

    1

    It says there are two missing directories:

    • ./target/debug/horaedb-server, the executable binary Rust code
    • /lib/x86_64-linux-gnu/libc.so, the Linux specific libc code

    The reason is that the original heap.pb.gz file only contains the addresses of the called functios instead of their names. pprof relies on the above two missing directories to map the function addresses to their names.

    It is worth mentioning that the symbolize feature is enabled in the jemalloc_pprof dependency, so the output heap.pb.gz file already contains all Rust functions names.

    In theory, maybe we can run pprof in other machines of the exact the same (otherwise the libc code may be different ) OS version of the HoraeDB machine, without seeing above symbolization faiure.

    But on other OSes, the failure still exists since they don't contain the correct /lib/x86_64-linux-gnu/libc.so file. For example, on my arm64 macOS machine, it shows:

    2

    But you can still see that the Rust funtion names are shown, due to the enabled symbolize feature in the jemalloc_pprof dependency. This feature may slightly hurt the performance, but also brings some convenience when using the pprof command, so I enable it in this commit.

    Above discussion is similar to this comment.

    An alternative is to add an API that returns the flamegraph directly, like this example

@BaldDemian
Copy link
Author

I doubt the necessity of adding these two scripts in integration tests🤔. A simple curl command may be enough to test the api.

@jiacai2050 jiacai2050 requested review from Copilot and jiacai2050 May 25, 2025 12:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new pprof-based heap profiling endpoint to the HTTP server, implements the backend dump logic using jemalloc’s pprof interface, and provides integration scripts to exercise the new API.

  • Register /debug/pprof/heap/{seconds} route in the HTTP layer
  • Implement dump_heap_pprof in the profiling component with tikv-jemalloc crates
  • Add two Python integration scripts for stress testing and requesting the new endpoint

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/server/src/http.rs Register new pprof heap endpoint, map PprofHeap error variant
src/components/profile/src/lib.rs Implement dump_heap_pprof, switch to tikv_jemalloc crates
src/components/profile/Cargo.toml Update dependencies: add jemalloc_pprof, tikv-jemalloc-*
integration_tests/pprof_heap/stress.py Add SQL workload stress test script (not directly pprof-related)
integration_tests/pprof_heap/request.py Add script to fetch and save raw heap pprof data
Comments suppressed due to low confidence (2)

src/server/src/http.rs:106

  • [nitpick] The error message uses 'Fail to do heap pprof', which is awkward; consider rephrasing to 'Failed to dump heap pprof' for clarity and consistent tense.
#[snafu(display("Fail to do heap pprof, err:{}.\nBacktrace:\n{}", source, backtrace))]

src/components/profile/src/lib.rs:187

  • The new dump_heap_pprof method lacks unit tests to validate its behavior; consider adding tests to cover successful pprof dumps and error paths (e.g. uninitialized PROF_CTL).
pub fn dump_heap_pprof(&self, seconds: u64) -> Result<Vec<u8>> {

Comment on lines +641 to +642
warp::path!("debug" / "pprof" / "heap" / ..)
.and(warp::path::param::<u64>())
Copy link
Preview

Copilot AI May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using .. in the warp::path! macro combined with a separate param can be error-prone; consider using warp::path!("debug" / "pprof" / "heap" / u64) to bind the segment directly for clearer routing.

Suggested change
warp::path!("debug" / "pprof" / "heap" / ..)
.and(warp::path::param::<u64>())
warp::path!("debug" / "pprof" / "heap" / u64)

Copilot uses AI. Check for mistakes.

let _guard = ProfLockGuard::new(lock_guard)?;

// wait for seconds for collect the profiling data
thread::sleep(time::Duration::from_secs(seconds));
Copy link
Preview

Copilot AI May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Since Duration is already imported, you can simplify this to thread::sleep(Duration::from_secs(seconds)); for consistency with existing imports.

Suggested change
thread::sleep(time::Duration::from_secs(seconds));
thread::sleep(Duration::from_secs(seconds));

Copilot uses AI. Check for mistakes.

@BaldDemian
Copy link
Author

Saw the failing CI checks. Could be related to the changed cargo.toml or cargo.lock. I will try fix them.

@jiacai2050
Copy link
Contributor

@BaldDemian Please don't update this PR anymore, I will try fix those CI errors.

@BaldDemian
Copy link
Author

@jiacai2050, thanks for your help! I tried fix these errors this morning but didn't succeed

Copy link
Contributor

@jiacai2050 jiacai2050 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiacai2050 jiacai2050 merged commit f5236d8 into apache:analytic-engine May 26, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants