-
Notifications
You must be signed in to change notification settings - Fork 0
docs: add finalized InfiniMetrics Dashboard user guide (Chinese) #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,147 @@ | ||
| # InfiniMetrics Dashboard User Guide | ||
| ## 1. Dashboard Overview | ||
|
|
||
| InfiniMetrics Dashboard provides a unified interface to visualize benchmark and evaluation results of AI accelerators across the following scenarios: | ||
|
|
||
| - Communication (NCCL / Collective Communication) | ||
|
|
||
| - Training (Training / Distributed Training) | ||
|
|
||
| - Inference (Direct / Service Inference) | ||
|
|
||
| - Operator (Core Operator Performance) | ||
|
|
||
| The benchmark framework produces two types of outputs: | ||
|
|
||
| ``` | ||
| JSON -> configuration / environment / scalar metrics | ||
| CSV -> curves / time-series data | ||
| ``` | ||
| The Dashboard automatically loads test results and provides unified analysis capabilities, including: | ||
|
|
||
| - un ID fuzzy search: locate specific test runs using partial Run IDs | ||
|
|
||
| - General filters: filter results by framework, model, device count, etc. | ||
|
|
||
| - Multi-run comparison: select multiple runs to compare performance | ||
|
|
||
| - Performance visualization: display curves such as latency / throughput / loss | ||
|
|
||
| - Statistics and configuration view: inspect throughput statistics, runtime configuration, and environment details | ||
|
|
||
| For example, you can enter: | ||
| ``` | ||
| allreduce | ||
| service | ||
| ``` | ||
| to perform fuzzy matching on Run IDs | ||
|
|
||
| Example screenshot: | ||
|  | ||
| ## 2. Running the Dashboard | ||
| ### 2.1 Environment Requirements | ||
|
|
||
| Before using the Dashboard, install the following dependencies: | ||
| ``` | ||
| streamlit | ||
| plotly | ||
| pandas | ||
| ``` | ||
| ### 2.2 Start the Dashboard | ||
|
|
||
| Run the following command in the project root directory: | ||
| ``` | ||
| python -m streamlit run dashboard/app.py | ||
| ``` | ||
| Access URL after startup: | ||
| ``` | ||
| Local URL: http://localhost:8501 | ||
| Network URL: http://<server-ip>:8501 | ||
| ``` | ||
| Explanation: | ||
|
|
||
| Local URL: accessible only on the local machine | ||
|
|
||
| Network URL: accessible from other machines within the same network | ||
|
|
||
| ## 3. Communication Test Analysis | ||
|
|
||
| Path: | ||
| ``` | ||
| Dashboard → Communication Performance Test | ||
| ``` | ||
| Supported features: | ||
| ``` | ||
| Bandwidth analysis curve - peak bandwidth | ||
|
|
||
| Latency analysis curve - average latency | ||
|
|
||
| Test duration | ||
|
|
||
| GPU memory usage | ||
|
|
||
| Communication configuration analysis | ||
| ``` | ||
| Example screenshot: | ||
|  | ||
| ## 4. Inference Test Analysis | ||
|
|
||
| Path: | ||
| ``` | ||
| Dashboard → Inference Performance Test | ||
| ``` | ||
| Modes: | ||
| ``` | ||
| Direct Inference | ||
| Service Inference | ||
| ``` | ||
| Displayed metrics: | ||
| ``` | ||
| TTFT | ||
|
|
||
| Latency | ||
|
|
||
| Throughput | ||
|
|
||
| GPU memory usage | ||
|
|
||
| Inference configuration analysis | ||
| ``` | ||
| Example screenshot: | ||
|  | ||
| ## 5. Training Test Analysis | ||
|
|
||
| Path: | ||
| ``` | ||
| Dashboard → Training Performance Test | ||
| ``` | ||
| Supported features: | ||
| ``` | ||
| Loss curve | ||
|
|
||
| Perplexity curve | ||
|
|
||
| Throughput curve | ||
|
|
||
| GPU memory usage | ||
|
|
||
| Training configuration analysis | ||
| ``` | ||
| Example screenshot: | ||
|  | ||
| ## 6. Operator Test Analysis | ||
|
|
||
| Path: | ||
| ``` | ||
| Dashboard → Operator Performance Test | ||
| ``` | ||
| Supported metrics: | ||
| ``` | ||
| latency | ||
|
|
||
| flops | ||
|
|
||
| bandwidth | ||
| ``` | ||
| Example screenshot: | ||
|  |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,157 @@ | ||
| # InfiniMetrics Dashboard 使用指南 | ||
|
|
||
| ## 1. Dashboard 简介 | ||
|
|
||
| InfiniMetrics Dashboard 用于统一展示 AI 加速卡在以下场景下的测试与评测结果 | ||
|
|
||
| - 通信(NCCL / 集合通信) | ||
| - 训练(Training / 分布式训练) | ||
| - 推理(Direct / Service 推理) | ||
| - 算子(核心算子性能) | ||
|
|
||
| 测试框架输出两类数据: | ||
| ``` | ||
| JSON -> 配置 / 环境 / 标量指标 | ||
| CSV -> 曲线 / 时序数据 | ||
| ``` | ||
| Dashboard 会自动加载测试结果,并提供统一的分析功能,包括: | ||
|
|
||
| - Run ID 模糊搜索:支持通过部分 Run ID 快速定位测试运行 | ||
|
|
||
| - 通用筛选器:按框架、模型、设备数量等条件筛选 | ||
|
|
||
| - 多运行对比:同时选择多个测试运行进行性能对比 | ||
|
|
||
| - 性能可视化:展示 latency / throughput / loss 等性能曲线 | ||
|
|
||
| - 统计与配置展示:查看吞吐量统计、运行配置和环境信息 | ||
|
|
||
| 例如可以输入: | ||
| ``` | ||
| allreduce | ||
| service | ||
| ``` | ||
| 对 Run ID 进行模糊匹配搜索 | ||
|
|
||
| 示例截图: | ||
|  | ||
| ## 2. 运行 Dashboard | ||
| ### 2.1 环境依赖 | ||
| 使用 Dashboard 前需要安装以下依赖: | ||
| ``` | ||
| streamlit | ||
| plotly | ||
| pandas | ||
| ``` | ||
| ### 2.2 启动 Dashboard | ||
| 在项目根目录执行: | ||
| ``` | ||
| python -m streamlit run dashboard/app.py | ||
| ``` | ||
| 访问地址,启动成功后显示: | ||
| ``` | ||
| Local URL: http://localhost:8501 | ||
| Network URL: http://<server-ip>:8501 | ||
| ``` | ||
| 说明: | ||
|
|
||
| Local URL:仅本机访问 | ||
|
|
||
| Network URL:同一网络内其他机器可访问 | ||
|
|
||
| ## 3. 通信测试分析 | ||
| 路径: | ||
|
|
||
| ``` | ||
| Dashboard → 通信性能测试 | ||
| ``` | ||
|
|
||
| 支持: | ||
| ``` | ||
| 带宽分析曲线 - 峰值带宽 | ||
|
|
||
| 延迟分析曲线 - 平均延迟 | ||
|
|
||
| 测试耗时 | ||
|
|
||
| 显存使用 | ||
|
|
||
| 通信配置解析 | ||
| ``` | ||
|
|
||
| 示例截图: | ||
|
|
||
|  | ||
| ## 4. 推理测试分析 | ||
|
|
||
| 路径: | ||
|
|
||
| ``` | ||
| Dashboard → 推理性能测试 | ||
| ``` | ||
|
|
||
| 模式: | ||
| ``` | ||
| Direct Inference | ||
| Service Inference | ||
| ``` | ||
| 展示指标: | ||
| ``` | ||
| TTFT | ||
|
|
||
| Latency | ||
|
|
||
| Throughput | ||
|
|
||
| 显存使用 | ||
|
|
||
| 推理配置解析 | ||
| ``` | ||
| 示例截图: | ||
|
|
||
|  | ||
|
|
||
| ## 5. 训练测试分析 | ||
| 路径: | ||
|
|
||
| ``` | ||
| Dashboard → 训练性能测试 | ||
| ``` | ||
|
|
||
| 支持: | ||
| ``` | ||
| Loss 曲线 | ||
|
|
||
| Perplexity 曲线 | ||
|
|
||
| Throughput 曲线 | ||
|
|
||
| 显存使用 | ||
|
|
||
| 训练配置解析 | ||
| ``` | ||
| 示例截图: | ||
|
|
||
|  | ||
|
|
||
| ## 6. 算子测试分析 | ||
|
|
||
| 路径: | ||
|
|
||
| ``` | ||
| Dashboard → 算子性能测试 | ||
| ``` | ||
|
|
||
| 支持: | ||
| ``` | ||
| latency | ||
|
|
||
| flops | ||
|
|
||
| bandwidth | ||
| ``` | ||
|
|
||
| 示例截图: | ||
|
|
||
|  | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.