Skip to content
Open

Quant #1794

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
93482c2
20250926
Sep 6, 2025
d4154c6
delete conflict
Sep 6, 2025
203f32a
update script
Charles2530 Sep 7, 2025
5a314b3
重构script文件夹:完整的量化训练脚本系统
Charles2530 Sep 7, 2025
5946766
添加.gitignore忽略备份文件
Charles2530 Sep 7, 2025
1660168
feat: 重构训练脚本系统,添加量化类型动态修改和日志记录功能
Charles2530 Sep 7, 2025
d8b153b
update
Charles2530 Sep 8, 2025
807cf85
feat: 实现改进的tensor保存和命名系统
Charles2530 Sep 8, 2025
fc22485
feat: 统一tensor保存功能并增强可视化工具
Charles2530 Sep 9, 2025
665e8b9
Translate visualization scripts from Chinese to English
Charles2530 Sep 9, 2025
47d7579
update env
Charles2530 Sep 10, 2025
146f865
update
Charles2530 Sep 10, 2025
37553f6
update
Charles2530 Sep 10, 2025
e074ca4
Enhanced tensor collection and analysis system
Charles2530 Sep 12, 2025
88b2568
Add --control-iter parameter for tensor collection control
Charles2530 Sep 14, 2025
a82a2c4
Fix tensor saver iteration control
Charles2530 Sep 14, 2025
af2ba84
Add early exit when control_iter is reached
Charles2530 Sep 14, 2025
e291fd0
Fix iteration detection logic in tensor collection script
Charles2530 Sep 14, 2025
8dde48b
Fix sample idx tracking for proper tensor collection
Charles2530 Sep 14, 2025
6927d71
Fix control_iter exit logic timing
Charles2530 Sep 14, 2025
512f04b
Complete parameter consistency check and verification
Charles2530 Sep 14, 2025
2c329f5
Fix tensor collection issues: layer_idx and sample count
Charles2530 Sep 14, 2025
5605510
修复control_iter逻辑:确保执行完整的iteration
Charles2530 Sep 14, 2025
a6cea5c
修改iteration退出逻辑:确保运行1个iteration后跳出
Charles2530 Sep 14, 2025
5cc4855
修复run_wikipedia_tensor_collection.sh脚本逻辑
Charles2530 Sep 14, 2025
77a4915
简化脚本逻辑:让control_iter控制训练结束
Charles2530 Sep 14, 2025
4808c21
清理临时测试文件
Charles2530 Sep 14, 2025
78ba9ba
添加多线程可视化和溢出检测分析功能
Charles2530 Sep 14, 2025
b2ae407
清理测试文件
Charles2530 Sep 14, 2025
2b13b12
修正mxfp4量化类型定义:FP4-E2M1格式
Charles2530 Sep 14, 2025
53f0942
清理测试文件
Charles2530 Sep 14, 2025
903dbdc
添加测试tensor生成器和修复溢出检测分析器
Charles2530 Sep 14, 2025
41bfc2f
添加tensor文件诊断和修复工具
Charles2530 Sep 14, 2025
b53dfe4
将可视化脚本中的中文替换为英文
Charles2530 Sep 14, 2025
9830253
部分替换overflow_detection_analyzer.py中的中文为英文
Charles2530 Sep 14, 2025
8130005
将shell脚本中的中文替换为英文
Charles2530 Sep 14, 2025
1ace897
添加改进的溢出检测分析器,修复tensor加载问题并添加tqdm进度条
Charles2530 Sep 14, 2025
653e36f
feat: Enhanced multi-threaded tensor visualization with HiFP8 analysis
Charles2530 Sep 14, 2025
8d77c2e
feat: Add layer distribution analysis tool
Charles2530 Sep 14, 2025
aefe644
fix: Add 'weights' tensor type to attention tensor list
Charles2530 Sep 14, 2025
a7f9cd6
feat: Add tqdm progress bars to layer distribution analysis tool
Charles2530 Sep 14, 2025
91bdcd6
fix: Resolve progress bar hanging issue in tensor loading
Charles2530 Sep 14, 2025
ebe16ab
feat: Add support for large tensor files (>500MB)
Charles2530 Sep 14, 2025
9ff0d68
fix: Add BFloat16 data type support and conversion
Charles2530 Sep 14, 2025
83e820a
feat: 重构tensor收集和可视化系统,添加micro_batch控制功能
Charles2530 Sep 15, 2025
7321255
fix: 移除所有sample_idx参数,修复TypeError
Charles2530 Sep 15, 2025
0321245
update script
Charles2530 Sep 15, 2025
f1a30d5
refactor: 移除collect_micro_batches参数,简化tensor收集逻辑
Charles2530 Sep 15, 2025
16a5a1e
fix: 修复tensor_saver.py中increment_iteration_data方法调用错误
Charles2530 Sep 15, 2025
39922df
fix: 修复tensor收集后不会结束的问题
Charles2530 Sep 15, 2025
a267409
fix: 修复tensor收集完成后程序不强制结束的问题
Charles2530 Sep 15, 2025
759f5ab
fix: 修复tensor保存时的序列化错误
Charles2530 Sep 15, 2025
af27256
fix: 修复Linear层backward阶段缺少层序号的问题
Charles2530 Sep 15, 2025
8a0db2e
fix: 修复warmup和steady state阶段重复收集tensor的问题
Charles2530 Sep 15, 2025
42d0170
debug: 添加详细的调试日志来帮助诊断tensor收集问题
Charles2530 Sep 15, 2025
4331b15
fix: 修复should_exit_after_forward()函数逻辑和tensor收集流程
Charles2530 Sep 15, 2025
14e733f
pdb debug
Charles2530 Sep 15, 2025
f735b3d
fix: 修复llama32使用的no-pipelining模式tensor收集逻辑
Charles2530 Sep 15, 2025
da4e091
fix: 修复run_tensor_collection.sh中tensor保存路径问题
Charles2530 Sep 15, 2025
134ef60
debug: 添加详细的路径调试信息
Charles2530 Sep 15, 2025
9e61a97
cleanup: 删除所有测试文件和调试代码
Charles2530 Sep 15, 2025
7716a3e
simplify: 简化tensor存储日志输出
Charles2530 Sep 15, 2025
a462586
update: 更新三个脚本以适配当前路径结构
Charles2530 Sep 15, 2025
ad7d4d2
update: 修改可视化逻辑以适配单micro_batch收集模式
Charles2530 Sep 15, 2025
8823c09
fix: 修复可视化代码中的None值问题并过滤无效tensor
Charles2530 Sep 15, 2025
dc0e91c
fix: 修复KeyError问题,支持动态样本和层数据
Charles2530 Sep 15, 2025
44ece7d
refactor: 移除sample概念,直接使用rank概念
Charles2530 Sep 15, 2025
a2167e2
fix: 更新shell脚本以使用rank参数
Charles2530 Sep 15, 2025
cb5adeb
feat: 添加高效模式,只加载特定层和rank的tensor文件
Charles2530 Sep 15, 2025
18520a1
fix: 修改默认analysis_type为layer,确保使用高效模式
Charles2530 Sep 15, 2025
3b98124
fix: 移除不存在的_create_output_directories方法调用
Charles2530 Sep 15, 2025
db35d15
fix: 修复BFloat16类型转换错误
Charles2530 Sep 15, 2025
b8ffb55
Add comprehensive visualization script run_draw_all.sh
Charles2530 Sep 16, 2025
1427371
Fix 'sample' to 'rank' in tensor_visualizer.py
Charles2530 Sep 16, 2025
be51395
Clean up temporary test and debug files
Charles2530 Sep 17, 2025
0393801
feat: Add comprehensive tensor visualization and analysis tools
Charles2530 Sep 17, 2025
2e71ebc
fix: Add BFloat16 tensor support and empty tensor handling
Charles2530 Sep 17, 2025
f2f70c9
feat: Add default output directory for overflow analysis
Charles2530 Sep 17, 2025
941b4c2
reformat
Charles2530 Sep 17, 2025
d949a25
feat: Display analysis results before saving to log
Charles2530 Sep 17, 2025
352ade8
feat: Add tensor distribution visualization tool
Charles2530 Sep 17, 2025
1b1ad00
feat: Add tqdm progress bar to overflow_summary.py
Charles2530 Sep 17, 2025
6d3e4f3
feat: Add multi-file input support to overflow.py
Charles2530 Sep 17, 2025
a70036c
feat: Add multi-parameter support to layer_analysis.py
Charles2530 Sep 17, 2025
4cd1caf
feat: Add dynamic range adjustment and intelligent boundary display
Charles2530 Sep 17, 2025
85e6261
Add MXFP scaling analysis tools
Charles2530 Sep 20, 2025
4065890
Clean up: Remove outdated documentation files
Charles2530 Sep 20, 2025
13bc2c0
Add comprehensive logging functionality to MXFP scaling test
Charles2530 Sep 20, 2025
8f23aff
Fix scaling alignment logic for MXFP quantization
Charles2530 Sep 20, 2025
36646f2
Add intelligent scaling factor analysis and recommendations
Charles2530 Sep 20, 2025
7d4dc72
Clean up mxfp.py: Remove duplicate and commented code
Charles2530 Sep 20, 2025
251c367
Refactor underflow analysis for better organization and clarity
Charles2530 Sep 20, 2025
5951c69
Add multi-tensor support for MXFP scaling test tool
Charles2530 Sep 21, 2025
3bea64c
Complete overflow/underflow analysis implementation
Charles2530 Sep 21, 2025
0894223
Fix mxfp_scaling_test.py to correctly simulate mxfp.py behavior
Charles2530 Sep 22, 2025
c2d7ed1
Modify scaling range logic to use user-specified parameters directly
Charles2530 Sep 22, 2025
28c32b9
Remove num_levels parameter and use only integer scale exponents
Charles2530 Sep 22, 2025
761c2a6
Fix tie-breaking logic to consistently choose larger scale exponent
Charles2530 Sep 22, 2025
949c6fc
Integrate tensor saving functionality into quantization operators
Charles2530 Sep 22, 2025
c681433
Fix duplicate tensor saving and integrate new operators in layers.py
Charles2530 Sep 22, 2025
be79c95
Fix tensor saving integration by updating function calls
Charles2530 Sep 22, 2025
b33b3d5
Fix torch.autograd.Function.apply() keyword arguments issue
Charles2530 Sep 23, 2025
8dc28d9
Add bf16_matmul wrapper function for consistent tensor saving interface
Charles2530 Sep 23, 2025
763b842
Complete tensor saving integration for attention module
Charles2530 Sep 23, 2025
b07f2b9
Fix syntax error in HIFPBAddBmm forward method
Charles2530 Sep 23, 2025
b283f98
Fix AttributeError: ctx.metadata is not writable
Charles2530 Sep 23, 2025
a567727
Fix gradient count mismatch in BF16 operators
Charles2530 Sep 23, 2025
b573709
Fix gradient count for MXFPBAddBmm backward
Charles2530 Sep 23, 2025
c3dcdd1
Fix BF16BAddBmm backward gradient count
Charles2530 Sep 23, 2025
f12e3f0
Remove Pipeline DEBUG output messages
Charles2530 Sep 23, 2025
65185a3
Fix IndentationError in schedules.py
Charles2530 Sep 23, 2025
0fe34f3
Fix training configuration: use iteration-based training instead of s…
Charles2530 Sep 23, 2025
702a54a
Fix checkpoint loading: make --load conditional on checkpoint existence
Charles2530 Sep 23, 2025
49e0912
Fix optimizer parameter scheduler mismatch with checkpoint
Charles2530 Sep 23, 2025
f843e39
Fix parameter name: use underscore instead of hyphen
Charles2530 Sep 23, 2025
1f115cd
Remove unused bf16_linear function
Charles2530 Sep 23, 2025
cd80c66
Remove unused BF16Linear class
Charles2530 Sep 23, 2025
219cac3
Fix syntax error in tensor_saver.py
Charles2530 Sep 23, 2025
36ac671
Fix backward tensor collection issue
Charles2530 Sep 23, 2025
14a2f4e
Improve rank filtering logic in tensor_saver
Charles2530 Sep 23, 2025
a9da993
Fix 'break' outside loop syntax error
Charles2530 Sep 23, 2025
0edd9e4
Fix missing layer_idx in backward tensor saving
Charles2530 Sep 23, 2025
3c16433
Fix missing layer_idx for all linear backward tensors
Charles2530 Sep 23, 2025
87da0aa
Remove unused tensor_saver imports from dot_product_attention.py
Charles2530 Sep 23, 2025
6314b2d
Fix attention tensor naming in quantization operators
Charles2530 Sep 23, 2025
d6f52cc
Remove debug logs from TensorCollectionState
Charles2530 Sep 23, 2025
6a011cd
Improve MXFP scaling test output organization and add visualization t…
Charles2530 Sep 23, 2025
cd8bc51
Fix variable name in mxfp_scaling_test.py
Charles2530 Sep 23, 2025
adf6724
Fix MXFPBAddBmm backward gradient count mismatch
Charles2530 Sep 24, 2025
41eb51d
Fix MXFPBAddBmm backward gradient count - correct to 14
Charles2530 Sep 24, 2025
fc3f549
update backward in mxfp
Charles2530 Sep 24, 2025
5b343ac
Fix control_iter default value causing premature training exit
Charles2530 Sep 24, 2025
d88bb9b
Clean up Chinese outputs and debug logs for production use
Charles2530 Sep 24, 2025
6aff83b
Fix IndentationError in tensor_saver.py
Charles2530 Sep 24, 2025
eeaeb74
Add scaling_control parameter for MX quantization strategies
Charles2530 Sep 24, 2025
1f66cd7
Update default scaling_control to max_minus_1 for better numerical st…
Charles2530 Sep 24, 2025
ccd5d25
update backward in mxfp
Charles2530 Sep 24, 2025
b622a39
Implement time-resume adaptive quantization training system
Charles2530 Sep 24, 2025
a26ef35
Fix checkpoint saving/loading in adaptive quantization
Charles2530 Sep 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .config/clash/cache.db
Binary file not shown.
1 change: 1 addition & 0 deletions .config/clash/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
mixed-port: 7890
50 changes: 34 additions & 16 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,18 +1,36 @@
__pycache__
# 模拟数据和临时文件
enhanced_tensor_logs/
draw/
*.pt
*.png
*.jpg
*.jpeg

# Python缓存
__pycache__/
*.py[cod]
*$py.class
*.so
build
.coverage_*
*.egg-info
*~
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err

# 环境变量文件
.env
.venv
runs/
/test_cases/
**/dist/
env/
venv/

# IDE文件
.vscode/
.idea/
*.swp
*.swo

# 日志文件
*.log

# 临时文件
*.tmp
*.temp

# 系统文件
.DS_Store
Thumbs.db
Binary file added .pretrain_gpt.py.swo
Binary file not shown.
204 changes: 204 additions & 0 deletions LAYER_ANALYSIS_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
# Layer Distribution Analysis Tool

## 概述

专门分析某个层的tensor分布的工具,支持attention和linear层的q,k,v,output和input,weight,output分析。使用正则表达式匹配tensor文件,生成详细的分布图表和统计报告。

## 功能特性

### 1. 层分析功能
- **Attention层分析**: 分析query, key, value, output, attention_weights的分布
- **Linear层分析**: 分析input, weight, output, bias, hidden的分布
- **多子图显示**: 一个大图包含6个子图,展示不同tensor类型的分布
- **统计信息**: 每个子图显示均值、标准差等关键统计信息

### 2. 量化对比功能
- **多量化类型对比**: 同时显示bf16, mxfp8, mxfp4, hifp8的分布对比
- **特定tensor分析**: 可以针对特定tensor类型进行量化对比
- **2x2子图布局**: 清晰展示4种量化类型的分布差异

### 3. 统计报告
- **详细统计信息**: 包含均值、标准差、分位数等完整统计
- **文件计数**: 显示找到的文件数量
- **数据质量**: 显示有效数据点数量

## 使用方法

### 基本用法

#### Python脚本直接调用
```bash
# 分析attention层
python analyze_layer_distribution.py --layer 1 --sample 0 --layer_type attention

# 分析linear层
python analyze_layer_distribution.py --layer 2 --sample 1 --layer_type linear

# 量化对比分析
python analyze_layer_distribution.py --layer 1 --sample 0 --layer_type attention --tensor_type query --quantization_comparison
```

#### Shell脚本调用
```bash
# 基本用法
./run_layer_analysis.sh <tensor_dir> <output_dir> <layer> <sample> <layer_type>

# 示例
./run_layer_analysis.sh ./enhanced_tensor_logs ./layer_output 1 0 attention
./run_layer_analysis.sh ./enhanced_tensor_logs ./layer_output 2 1 linear

# 带量化对比
./run_layer_analysis.sh ./enhanced_tensor_logs ./layer_output 1 0 attention query true
```

### 参数说明

#### Python脚本参数
- `--tensor_dir`: 张量文件目录 (默认: ./enhanced_tensor_logs)
- `--output_dir`: 输出目录 (默认: ./layer_analysis_output)
- `--layer`: 层号 (必需, 如: 1, 2, 3, ...)
- `--sample`: 样本号 (必需, 如: 0, 1, 2)
- `--layer_type`: 层类型 (必需, attention 或 linear)
- `--tensor_type`: 特定tensor类型 (可选, 用于量化对比)
- `--quantization_comparison`: 启用量化对比 (可选)

#### Shell脚本参数
1. `tensor_dir`: 张量文件目录
2. `output_dir`: 输出目录
3. `layer`: 层号
4. `sample`: 样本号
5. `layer_type`: 层类型 (attention/linear)
6. `tensor_type`: 特定tensor类型 (可选)
7. `quantization_comparison`: 是否启用量化对比 (true/false)

## 输出文件

### 1. 层分析图表
- **文件名格式**: `layer_{layer}_sample_{sample}_{layer_type}_analysis.png`
- **内容**: 6个子图显示不同tensor类型的分布
- **统计信息**: 每个子图包含均值、标准差等统计信息

### 2. 量化对比图表
- **文件名格式**: `quantization_comparison_layer_{layer}_sample_{sample}_{layer_type}_{tensor_type}.png`
- **内容**: 2x2子图显示4种量化类型的分布对比
- **适用场景**: 需要比较不同量化类型对同一tensor的影响

### 3. 统计报告
- **文件名格式**: `statistics_layer_{layer}_sample_{sample}_{layer_type}.txt`
- **内容**: 详细的数值统计信息
- **包含信息**: 文件数量、数据点数量、均值、标准差、分位数等

## 支持的Tensor类型

### Attention层
- `query`: Query张量
- `key`: Key张量
- `value`: Value张量
- `output`: 输出张量
- `attention_weights`: 注意力权重矩阵

### Linear层
- `input`: 输入张量
- `weight`: 权重张量
- `output`: 输出张量
- `bias`: 偏置张量
- `hidden`: 隐藏层张量

## 文件命名格式支持

工具支持以下文件命名格式:
```
YYYYMMDD_HHMMSS_XXXX_iterXXX_layer_type_LX_operation_phase_component_quant_type_rankXX_sampleXXX_groupXXX_tensor_name.pt
```

示例:
```
20250914_075006_1399_iter000_attention_L1_forward_post_FA_bf16_rank07_sample000_group000_attention_weights.pt
```

## 使用示例

### 示例1: 分析第1层第0个样本的attention分布
```bash
python analyze_layer_distribution.py --layer 1 --sample 0 --layer_type attention
```

### 示例2: 分析第2层第1个样本的linear分布
```bash
python analyze_layer_distribution.py --layer 2 --sample 1 --layer_type linear
```

### 示例3: 对比第1层第0个样本query张量的量化效果
```bash
python analyze_layer_distribution.py --layer 1 --sample 0 --layer_type attention --tensor_type query --quantization_comparison
```

### 示例4: 使用shell脚本分析
```bash
# 分析attention层
./run_layer_analysis.sh ./enhanced_tensor_logs ./output 1 0 attention

# 分析linear层并启用量化对比
./run_layer_analysis.sh ./enhanced_tensor_logs ./output 2 1 linear weight true
```

## 技术特性

### 数据处理
- **自动数据清理**: 自动移除NaN和Inf值
- **数据采样**: 大数据集自动采样以提高性能
- **多文件合并**: 自动合并同一类型的多个tensor文件

### 可视化质量
- **高分辨率**: 300 DPI输出
- **专业配色**: 科学可视化标准配色
- **清晰标注**: 详细的图表标签和统计信息

### 错误处理
- **文件损坏处理**: 自动跳过损坏的tensor文件
- **格式兼容性**: 支持多种tensor文件格式
- **优雅降级**: 数据缺失时显示友好提示

## 依赖要求

### Python包
- torch
- matplotlib
- numpy
- pandas
- seaborn

### 安装依赖
```bash
pip install torch matplotlib numpy pandas seaborn scipy
```

## 注意事项

1. **文件格式**: 确保tensor文件格式正确且可读
2. **内存使用**: 处理大量tensor文件时注意内存使用
3. **输出目录**: 确保对输出目录有写权限
4. **层和样本**: 确保指定的层和样本存在对应的tensor文件

## 故障排除

### 常见问题
1. **No data found**: 检查层号、样本号和层类型是否正确
2. **No valid data**: 检查tensor文件是否损坏或格式不正确
3. **Import error**: 安装缺失的Python包
4. **Permission denied**: 检查输出目录的写权限

### 调试建议
1. 检查tensor文件是否存在
2. 验证文件名格式是否正确
3. 确认Python环境配置
4. 查看详细错误信息

## 版本历史

### v1.0.0 (当前版本)
- 基础层分析功能
- 支持attention和linear层
- 量化对比功能
- 统计报告生成
- Shell脚本封装
Loading