[tool] feature: scheduling analysis based on profiling data by mayunaise · Pull Request #5248 · verl-project/verl

mayunaise · 2026-02-09T09:42:02Z

What does this PR do?

Summary

This PR introduces Cluster Analyse, a powerful visualization tool designed to transform raw performance metrics into high-fidelity, actionable timeline insights. It simplifies the complex task of debugging and optimizing distributed RL workloads within the VeRL framework.

From Raw Logs to Instant Insights
Manual analysis of discrete profiling data is time-consuming and error-prone. This tool automates the parsing of profiling logs, allowing developers to visualize the entire RL pipeline (Rollout, Evaluation, Training) in a cohesive temporal view.
Democratizing Performance Optimization
By visualizing stage-by-stage latency, this tool helps researchers pinpoint issues without deep-diving into raw JSON traces.
Strengthening the VeRL Ecosystem
As RL scales to larger clusters, observability becomes a first-class requirement. This feature fills a critical gap in the VeRL toolchain, providing a standardized way for the community to share and compare performance profiles.

Sample

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python cluster_analysis.py --input-path ./data --output-path ./output

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

CLAassistant · 2026-02-09T09:42:16Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces a new tool for scheduling analysis based on profiling data. The implementation is comprehensive, including data preprocessing, parsing, visualization, and testing. However, I've identified several critical issues related to code robustness, portability, and correctness that could lead to runtime crashes or unexpected behavior. These include fragile logic that relies on specific file naming conventions, potential null reference exceptions, and incorrect type hint imports. Addressing these issues will significantly improve the tool's reliability and maintainability.

gemini-code-assist · 2026-02-09T09:44:01Z

verl/tools/cluster_analysis/data_preprocessor.py

+        try:
+            for (map_key, dir_list) in rank_id_map.items():
+                dir_list.sort(key=lambda x: x.split('_')[-3])
+                self.data_map[map_key] = dir_list
+        except Exception as e:
+            raise RuntimeError("Found invalid directory name!") from e


The sorting key lambda x: x.split('_')[-3] is very fragile and will raise an IndexError if a directory name doesn't have at least three underscores. This error is caught by a broad except Exception block which then raises a generic RuntimeError, crashing the program with an unhelpful message. This should be made more robust to handle unexpected directory names without crashing.

Suggested change

try:

for (map_key, dir_list) in rank_id_map.items():

dir_list.sort(key=lambda x: x.split('_')[-3])

self.data_map[map_key] = dir_list

except Exception as e:

raise RuntimeError("Found invalid directory name!") from e

for (map_key, dir_list) in rank_id_map.items():

try:

dir_list.sort(key=lambda x: x.split('_')[-3])

except IndexError:

logger.warning(f"Could not sort paths for {map_key} due to unexpected directory name format. The order may be incorrect.")

self.data_map[map_key] = dir_list

gemini-code-assist · 2026-02-09T09:44:01Z

verl/tools/cluster_analysis/parser.py

+        # Convert to milliseconds
+        us_to_ms = Constant.US_TO_MS
+        start_time_ms = start_ids / us_to_ms
+        duration_ms = (end_ids - start_ids) / us_to_ms
+        end_time_ms = start_time_ms + duration_ms
+
+        event_data = {
+            'name': roll,
+            "roll": roll,
+            'domain': "default",
+            'start_time_ms': start_time_ms,
+            'end_time_ms': end_time_ms,
+            'duration_ms': duration_ms,
+            'rank_id': rank_id,
+            'tid': row["pid"]
+        }
+
+        events.append(event_data)
+
+        return events


There are two issues here. First, if no events are found for the process_id, start_ids will remain None, and line 154 will raise a TypeError. Second, if the loop at line 124 finds no matching events, row will be unbound, and line 166 will raise an UnboundLocalError. You should handle the case where no events are found, and use process_id for the tid.

if start_ids is None: return events # Convert to milliseconds us_to_ms = Constant.US_TO_MS start_time_ms = start_ids / us_to_ms duration_ms = (end_ids - start_ids) / us_to_ms end_time_ms = start_time_ms + duration_ms event_data = { 'name': roll, "roll": roll, 'domain': "default", 'start_time_ms': start_time_ms, 'end_time_ms': end_time_ms, 'duration_ms': duration_ms, 'rank_id': rank_id, 'tid': process_id } events.append(event_data) return events

verl/tools/cluster_analysis/visualizer.py

gemini-code-assist · 2026-02-09T09:44:02Z

verl/tools/cluster_analysis/visualizer.py

+    t0 = df["Start"].min()
+    df["Start"] -= t0
+    df["Finish"] -= t0
+    df["Duration"] = df["Finish"] - df["Start"]
+    return df, t0


If the dataframe df is empty after the cleaning steps above (e.g., all rows have NaNs or invalid start/finish times), df["Start"].min() will raise a ValueError because min() will be called on an empty sequence. You should handle the case of an empty dataframe gracefully, for example by returning early.

Suggested change

t0 = df["Start"].min()

df["Start"] -= t0

df["Finish"] -= t0

df["Duration"] = df["Finish"] - df["Start"]

return df, t0

if df.empty:

return df, 0.0

t0 = df["Start"].min()

df["Start"] -= t0

df["Finish"] -= t0

df["Duration"] = df["Finish"] - df["Start"]

return df, t0

gemini-code-assist · 2026-02-09T09:44:02Z

verl/tools/cluster_analysis/cluster_analysis.py

+        for dir_name in dirs:
+            if dir_name.endswith(Constant.PT_PROF_SUFFIX):
+                path = os.path.join(root, dir_name)
+                ascend_pt_dirs.append({"roll": os.path.dirname(path).split('/')[-1], "path": path})


The use of split('/') to extract the directory name is not portable and will fail on operating systems that use a different path separator, such as Windows. You should use os.path.basename to make this code cross-platform.

Suggested change

ascend_pt_dirs.append({"roll": os.path.dirname(path).split('/')[-1], "path": path})

ascend_pt_dirs.append({"roll": os.path.basename(os.path.dirname(path)), "path": path})

gemini-code-assist · 2026-02-09T09:44:02Z

verl/tools/cluster_analysis/parser.py

+ClusterParserFn = Callable[
+    [
+        str,
+        str,
+        Dict
+    ],
+    pd.DataFrame,
+]


The type hint ClusterParserFn is defined as Callable[[str, str, Dict], pd.DataFrame], but the registered functions like cluster_parser_mstx have the signature (config: Dict) -> pd.DataFrame. The type hint should match the actual function signature to avoid confusion and potential issues with static analysis tools.

ClusterParserFn = Callable[[Dict], pd.DataFrame]

tardis-key · 2026-02-09T11:07:44Z

verl/tools/cluster_analysis/cluster_analysis.py

+            if dir_name.endswith(Constant.PT_PROF_SUFFIX):
+                path = os.path.join(root, dir_name)
+                ascend_pt_dirs.append({"roll": os.path.dirname(path).split("/")[-1], "path": path})
+    data_processor = DataPreprocessor(ascend_pt_dirs)


the code here needs to be generic and not Ascend-specific.

tardis-key · 2026-02-09T11:08:54Z

verl/tools/cluster_analysis/cluster_analysis.py

+
+
+def main():
+    arg_parser = argparse.ArgumentParser(description="集群调度可视化")


Translate the comments and help information into English.

tardis-key · 2026-02-09T11:09:43Z

verl/tools/cluster_analysis/data_preprocessor.py

+    def __init__(self, path_list: list[dict]) -> None:
+        self.path_list = path_list
+        self.data_map = {}
+        pass


tardis-key · 2026-02-09T11:10:27Z

verl/tools/cluster_analysis/data_preprocessor.py

+            rank_id_map[(task_roll, rank_id)].append(dir_name)
+        try:
+            for map_key, dir_list in rank_id_map.items():
+                dir_list.sort(key=lambda x: x.split("_")[-3])


magic number

tardis-key · 2026-02-12T02:22:29Z

verl/tools/cluster_analysis/schema.py

+
+class DataMap(TypedDict):
+    rank_id: int
+    roll: str


what's the meaning of ROLL

mayunaise · 2026-02-12T02:26:08Z

verl/tools/cluster_analysis/cluster_analysis.py

+def main():
+    arg_parser = argparse.ArgumentParser(description="Cluster scheduling visualization")
+    arg_parser.add_argument("--input-path", default="test", help="Raw path of profiling data")
+    arg_parser.add_argument("--profiler-type", default="mstx", help="Profiling data type")


在help加上profiler-type支持的全量类型，CLUSTER_PARSER_REGISTRY.keys()

Rhetee · 2026-02-14T01:36:03Z

verl/tools/cluster_analysis/cluster_analysis.py

+def main():
+    arg_parser = argparse.ArgumentParser(description="Cluster scheduling visualization")
+    arg_parser.add_argument("--input-path", default="test", help="Raw path of profiling data")
+    arg_parser.add_argument("--profiler-type", default="mstx", help="Profiler type")


Add descriptions of currently supported options in the help information.

Rhetee · 2026-02-14T03:29:12Z

verl/tools/cluster_analysis/mstx_parser.py

+        super().__init__(params)
+
+    # TODO: Future support for parsing with MSTX events
+    def _parse_rl_mstx_event(self, profiler_data_path: str, rank_id: int, role: str) -> list[EventRow]:


The function is not being called. What is its purpose, and is it necessary?

gemini-code-assist bot reviewed Feb 9, 2026

View reviewed changes

tardis-key marked this pull request as draft February 9, 2026 09:44

mayunaise force-pushed the cluster_analysis branch 2 times, most recently from 8f7ad0c to 01268e5 Compare February 9, 2026 10:46

tardis-key reviewed Feb 9, 2026

View reviewed changes

mayunaise force-pushed the cluster_analysis branch from 4897663 to 12cd5c8 Compare February 11, 2026 03:47

tardis-key reviewed Feb 12, 2026

View reviewed changes

verl/tools/cluster_analysis/schema.py Outdated

class DataMap(TypedDict):

rank_id: int

roll: str

Copy link

Collaborator

tardis-key Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of ROLL

mayunaise commented Feb 12, 2026

View reviewed changes

tardis-key and others added 4 commits February 13, 2026 10:34

init framework

be80232

[tool] feature: scheduling analysis based on profiling data

77788c1

refactor the code for better extensibility

3e367d1

resolve pending code reviews

48c9a18

mayunaise force-pushed the cluster_analysis branch from 12cd5c8 to 48c9a18 Compare February 13, 2026 02:36

Rhetee reviewed Feb 14, 2026

View reviewed changes

	ascend_pt_dirs.append({"roll": os.path.dirname(path).split('/')[-1], "path": path})
	ascend_pt_dirs.append({"roll": os.path.basename(os.path.dirname(path)), "path": path})



		def main():
		arg_parser = argparse.ArgumentParser(description="集群调度可视化")

Conversation

mayunaise commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mayunaise commented Feb 9, 2026 •

edited

Loading

CLAassistant commented Feb 9, 2026 •

edited

Loading