Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 150 additions & 0 deletions docs/dashboard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# InfiniMetrics Dashboard User Guide
## 1. Dashboard Overview

InfiniMetrics Dashboard provides a unified interface to visualize benchmark and evaluation results of AI accelerators across the following scenarios:

- Communication (NCCL / Collective Communication)

- Training (Training / Distributed Training)

- Inference (Direct / Service Inference)

- Operator (Core Operator Performance)

The benchmark framework produces two types of outputs:

```
JSON -> configuration / environment / scalar metrics
CSV -> curves / time-series data
```
The Dashboard automatically loads test results and provides unified analysis capabilities, including:

- un ID fuzzy search: locate specific test runs using partial Run IDs

- General filters: filter results by framework, model, device count, etc.

- Multi-run comparison: select multiple runs to compare performance

- Performance visualization: display curves such as latency / throughput / loss

- Statistics and configuration view: inspect throughput statistics, runtime configuration, and environment details

For example, you can enter:
```
allreduce
service
```
to perform fuzzy matching on Run IDs

Example screenshot:
![Run_ID research](./images/runid_research.jpg)
## 2. Running the Dashboard
### 2.1 Environment Requirements

Before using the Dashboard, install the following dependencies:
```
streamlit
plotly
pandas
```
### 2.2 Start the Dashboard

Run the following command in the project root directory:
```
python -m streamlit run dashboard/app.py
```
Access URL after startup:
```
Local URL: http://localhost:8501
Network URL: http://<server-ip>:8501
```
Explanation:

Local URL: accessible only on the local machine

Network URL: accessible from other machines within the same network

## 3. Communication Test Analysis

Path:
```
Dashboard → Communication Performance Test
```
Supported features:
```
Bandwidth analysis curve - peak bandwidth

Latency analysis curve - average latency

Test duration

GPU memory usage

Communication configuration analysis
```
Example screenshot:
![Communication Test](./images/dashboard_communication.jpg)

## 4. Inference Test Analysis

Path:
```
Dashboard → Inference Performance Test
```
Modes:
```
Direct Inference
Service Inference
```
Displayed metrics:
```
TTFT

Latency

Throughput

GPU memory usage

Inference configuration analysis
```
Example screenshot:
![Inference Test](./images/dashboard_inference.jpg)

## 5. Training Test Analysis

Path:
```
Dashboard → Training Performance Test
```
Supported features:
```
Loss curve

Perplexity curve

Throughput curve

GPU memory usage

Training configuration analysis
```
Example screenshot:
![Training Test](./images/dashboard_training.jpg)

## 6. Operator Test Analysis

Path:
```
Dashboard → Operator Performance Test
```
Supported metrics:
```
latency

flops

bandwidth
```
Example screenshot:
![Operator Test](./images/dashboard_operators.jpg)
Binary file added docs/images/dashboard_communication.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard_inference.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard_operators.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/dashboard_training.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/runid_research.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
157 changes: 157 additions & 0 deletions docs/zh/dashboard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# InfiniMetrics Dashboard 使用指南

## 1. Dashboard 简介

InfiniMetrics Dashboard 用于统一展示 AI 加速卡在以下场景下的测试与评测结果

- 通信(NCCL / 集合通信)
- 训练(Training / 分布式训练)
- 推理(Direct / Service 推理)
- 算子(核心算子性能)

测试框架输出两类数据:
```
JSON -> 配置 / 环境 / 标量指标
CSV -> 曲线 / 时序数据
```
Dashboard 会自动加载测试结果,并提供统一的分析功能,包括:

- Run ID 模糊搜索:支持通过部分 Run ID 快速定位测试运行

- 通用筛选器:按框架、模型、设备数量等条件筛选

- 多运行对比:同时选择多个测试运行进行性能对比

- 性能可视化:展示 latency / throughput / loss 等性能曲线

- 统计与配置展示:查看吞吐量统计、运行配置和环境信息

例如可以输入:
```
allreduce
service
```
对 Run ID 进行模糊匹配搜索

示例截图:

![Run ID搜索](../images/runid_research.jpg)
## 2. 运行 Dashboard
### 2.1 环境依赖
使用 Dashboard 前需要安装以下依赖:
```
streamlit
plotly
pandas
```
### 2.2 启动 Dashboard
在项目根目录执行:
```
python -m streamlit run dashboard/app.py
```
访问地址,启动成功后显示:
```
Local URL: http://localhost:8501
Network URL: http://<server-ip>:8501
```
说明:

Local URL:仅本机访问

Network URL:同一网络内其他机器可访问

## 3. 通信测试分析
路径:

```
Dashboard → 通信性能测试
```

支持:
```
带宽分析曲线 - 峰值带宽

延迟分析曲线 - 平均延迟

测试耗时

显存使用

通信配置解析
```

示例截图:

![通信测试](../images/dashboard_communication.jpg)
## 4. 推理测试分析

路径:

```
Dashboard → 推理性能测试
```

模式:
```
Direct Inference
Service Inference
```
展示指标:
```
TTFT

Latency

Throughput

显存使用

推理配置解析
```
示例截图:

![推理测试](../images/dashboard_inference.jpg)

## 5. 训练测试分析
路径:

```
Dashboard → 训练性能测试
```

支持:
```
Loss 曲线

Perplexity 曲线

Throughput 曲线

显存使用

训练配置解析
```
示例截图:

![训练测试](../images/dashboard_training.jpg)

## 6. 算子测试分析

路径:

```
Dashboard → 算子性能测试
```

支持:
```
latency

flops

bandwidth
```

示例截图:

![算子测试](../images/dashboard_operators.jpg)
Loading