LLaMA2 70B H100 性能問題 #5180

ShinoharaHare · 2023-12-11T20:39:57Z

ShinoharaHare
Dec 11, 2023

我用了 4 節點共 32 卡的 H100 機器測試了 LLaMA2 70B 的性能，但 TFLOPS 只有 170 左右，這似乎和 A100 差不多，我不太確定這是不是合理的一個數值，如果不是，想請教一下可能的問題點。

平行化參數如下：

TP=8
PP=1
ZERO=2

num_samples: 3, dp_world_size: 4, flop_megatron: 6880279910154240.0, flop: 6780680423866368, avg_duration: 4.944727897644043, avg_throughput: 2.4268271679246647
Throughput: 2.43 samples/sec, TFLOPS per GPU by Megatron: 173.93, TFLOPS per GPU: 171.41
Max CUDA memory usage: 63.46 GB

另外，如果我要在 9 個 8 卡 H100 節點訓練 LLaMA2 70B，請問會建議用什麼樣的平行化配置？

kurisusnowdeng · 2023-12-12T02:54:40Z

kurisusnowdeng
Dec 12, 2023
Maintainer

显存似乎还有空间，是否试过提高batch size有什么结果。其次，是否确认了flash attention是被正常使用的。

7 replies

ShinoharaHare Dec 12, 2023
Author

@kurisusnowdeng
我又用了單機 8 卡測試了一下 7B 模型，TFLOPS 高了許多，但奇怪的是開了 TP 後能用的 Batch Size 反而更小

TP=1, BS=16

num_samples: 48, dp_world_size: 8, flop_megatron: 1.1976567844503552e+16, flop: 10598611339444224, avg_duration: 26.73107147216797, avg_throughput: 14.365305199224853
Throughput: 14.37 samples/sec, TFLOPS per GPU by Megatron: 448.04, TFLOPS per GPU: 396.49
Max CUDA memory usage: 70.93 GB

TP=2, BS=12

num_samples: 36, dp_world_size: 4, flop_megatron: 8982425883377664.0, flop: 7948958504583168, avg_duration: 11.649053573608398, avg_throughput: 12.361519250476102
Throughput: 12.36 samples/sec, TFLOPS per GPU by Megatron: 385.54, TFLOPS per GPU: 341.18
Max CUDA memory usage: 56.90 GB

TP=2, BS=16

OOM

TP=4, BS=12

num_samples: 36, dp_world_size: 2, flop_megatron: 8982425883377664.0, flop: 7948958504583168, avg_duration: 7.082586765289307, avg_throughput: 10.165777333339708
Throughput: 10.17 samples/sec, TFLOPS per GPU by Megatron: 317.06, TFLOPS per GPU: 280.58
Max CUDA memory usage: 55.95 GB

TP=4, BS=16

OOM

flybird11111 Dec 13, 2023
Collaborator

请问你在单卡测试7B模型分别是使用的哪一个plugin呀？

ShinoharaHare Dec 13, 2023
Author

這裡用的都是 HybridParallelPlugin

ShinoharaHare Dec 14, 2023
Author

@flybird11111 您好
我注意到最新的 commit 您似乎更新了 DistCrossEntropy 的功能，因此特別聲明一下先前的測試都是基於尚未更新 DistCrossEntropy 之前的原始碼

flybird11111 Dec 14, 2023
Collaborator

嗯嗯，好的

flybird11111 · 2023-12-13T02:36:14Z

flybird11111
Dec 13, 2023
Collaborator

@kurisusnowdeng 我又用了單機 8 卡測試了一下 7B 模型，TFLOPS 高了許多，但奇怪的是開了 TP 後能用的 Batch Size 反而更小

TP=1, BS=16

num_samples: 48, dp_world_size: 8, flop_megatron: 1.1976567844503552e+16, flop: 10598611339444224, avg_duration: 26.73107147216797, avg_throughput: 14.365305199224853
Throughput: 14.37 samples/sec, TFLOPS per GPU by Megatron: 448.04, TFLOPS per GPU: 396.49
Max CUDA memory usage: 70.93 GB

TP=2, BS=12

num_samples: 36, dp_world_size: 4, flop_megatron: 8982425883377664.0, flop: 7948958504583168, avg_duration: 11.649053573608398, avg_throughput: 12.361519250476102
Throughput: 12.36 samples/sec, TFLOPS per GPU by Megatron: 385.54, TFLOPS per GPU: 341.18
Max CUDA memory usage: 56.90 GB

TP=2, BS=16

OOM

TP=4, BS=12

num_samples: 36, dp_world_size: 2, flop_megatron: 8982425883377664.0, flop: 7948958504583168, avg_duration: 7.082586765289307, avg_throughput: 10.165777333339708
Throughput: 10.17 samples/sec, TFLOPS per GPU by Megatron: 317.06, TFLOPS per GPU: 280.58
Max CUDA memory usage: 55.95 GB

TP=4, BS=16

OOM

感谢您，我会尽快查看和解决这个问题

18 replies

ShinoharaHare Dec 15, 2023
Author

目前用 6 節點，ZERO=2, TP=4, BS=2 可以達到 270 TFLOPS，請問有什麼優化建議嗎？

num_samples: 6, dp_world_size: 12, flop_megatron: 1.376055982030848e+16, flop: 13561360847732736, avg_duration: 12.496347427368164, avg_throughput: 5.76168359742524
Throughput: 5.76 samples/sec, TFLOPS per GPU by Megatron: 275.29, TFLOPS per GPU: 271.31
Max CUDA memory usage: 60.15 GB

flybird11111 Dec 15, 2023
Collaborator

目前这个设置是最优的吗？您可以尝试增大tp？我们测试的实际通信带宽在68GBps左右。

ShinoharaHare Dec 15, 2023
Author

是的，TP=8, BS=4 會稍微慢一點，變 264 TFLOPS 左右

ShinoharaHare Dec 15, 2023
Author

@flybird11111
請問您 GBps 的 B 應該是指小寫的 b bit 嗎？
我使用 iperf3 測試了一下各節點間的速度，有的可以到 34 Gbit/s，有的則是 20 Gbit/s，看來瓶頸就是出在這？

flybird11111 Dec 19, 2023
Collaborator

GBps是大写的B，没有写错.

ShinoharaHare · 2023-12-14T04:08:26Z

ShinoharaHare
Dec 14, 2023
Author

您好，我在訓練 LLaMA2 70B 的時候，存檔和讀檔有遇到 OOM 的問題，因此想另外請教一下

配置：
ColossalAI: 79718fa
Plugin: HybridParallelPlugin

zero_stage=2
tp_size=4
pp_size=1
precision=bf16
Hardware: 6 (Nodes) x 8 H100
Batch Size: 2
Max Length: 4096

狀況：

存檔後 OOM：訓練正常運行，前幾次的存檔也正常，大概在第 5 次成功存檔後，訓練就 CUDA OOM 了，有沒有可能存檔的過程有造成顯存洩漏？

self.booster.save_model(
    self.boosted_model,
    os.path.join(checkpoint_path, _CKPT_MODEL_DIR),
    shard=True,
    size_per_shard=1024,
    use_safetensors=True
)
self.booster.save_optimizer(
    self.optimizers[0],
    os.path.join(checkpoint_path, _CKPT_OPTIMIZER_DIR),
    shard=True,
    size_per_shard=1024
)

載入優化器狀態 OOM：直接就 CUDA OOM 了，根據下面的原始碼片段來說，是不是因為載入的時候是先載入完整的優化器狀態才分片導致的？理論上不是應該邊載入邊分片才能省顯存

ColossalAI/colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py

Lines 572 to 606 in 79718fa

    
           # Load saved states to optimizer. 
        
           # Keep a record of loaded files so that file will not be repeatedly loaded. 
        
           loaded_file = set() 
        
           for pg in optimizer.optim.param_groups: 
        
               for param in pg["params"]: 
        
                   if param is None: 
        
                       continue 
        
                   param_id = _get_param_id_from_optimizer_param(param, master_to_working_map) 
        
                   if param_id not in weight_map: 
        
                       continue 
        
                   filename = weight_map[param_id] 
        
                   # If this param's states has been loaded before, directly return. 
        
                   if filename in loaded_file: 
        
                       continue 
        
                   file_path = os.path.join(ckpt_root_path, filename) 
        
                   state_dict = load_shard_state_dict(Path(file_path), use_safetensors=False) 
        
                   load_states_into_optimizer(optimizer.optim, state_dict, id_map, strict=True) 
        
                   loaded_file.add(filename) 
        
           # Then shard the loaded optimizer states if using tp/zero. 
        
           for param, state in optimizer.optim.state.items(): 
        
               device = param.device 
        
               if master_to_working_map is not None: 
        
                   working_param = master_to_working_map[id(param)] 
        
               else: 
        
                   working_param = param 
        
               original_shape = optimizer.param_info["param2shape"][id(working_param)] 
        
               sharded_state = self.shard_from_complete_optimizer_state( 
        
                   state, current_shape=working_param.shape, original_shape=original_shape, device=device, inplace=True 
        
               ) 
        
               optimizer.optim.state[param] = sharded_state 
        
           sharded_optimizer_loading_epilogue(optimizer.optim)

self.booster.load_model(self.boosted_model, os.path.join(checkpoint_path, _CKPT_MODEL_DIR))
self.booster.load_optimizer(self.optimizers[0], os.path.join(checkpoint_path, _CKPT_OPTIMIZER_DIR))

6 replies

ShinoharaHare Dec 14, 2023
Author

應該是不會，因為我確定是存完檔再繼續 forward 的時候立刻就 OOM 了，應該是跟存檔有直接關係沒錯
不過後續會再確定一下

ShinoharaHare Dec 15, 2023
Author

@flybird11111 您好，後續測試了一下，確實不存檔就不會 OOM

flybird11111 Dec 15, 2023
Collaborator

感谢，我看看

ShinoharaHare Dec 19, 2023
Author

@flybird11111
存擋和讀檔問題有後續嗎？
不知道 GeminiPlugin 有沒有相同問題，我後來試 Gemini + offload 速度似乎比 Hybrid + TP 更快，如果它沒有這個問題，我可能就先用 Gemini 了

flybird11111 Dec 19, 2023
Collaborator

可以的，gemini保存和加载是没有问题的。

flybird11111 · 2023-12-21T05:01:35Z

flybird11111
Dec 21, 2023
Collaborator

zero_stage

Hi，看起来Hybrid ParallelPlugin存档也没有影响显存的使用。

@flybird11111 存擋和讀檔問題有後續嗎？不知道 GeminiPlugin 有沒有相同問題，我後來試 Gemini + offload 速度似乎比 Hybrid + TP 更快，如果它沒有這個問題，我可能就先用 Gemini 了

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMA2 70B H100 性能問題 #5180

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 31 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LLaMA2 70B H100 性能問題 #5180

ShinoharaHare Dec 11, 2023

Replies: 4 comments · 31 replies

kurisusnowdeng Dec 12, 2023 Maintainer

ShinoharaHare Dec 12, 2023 Author

flybird11111 Dec 13, 2023 Collaborator

ShinoharaHare Dec 13, 2023 Author

ShinoharaHare Dec 14, 2023 Author

flybird11111 Dec 14, 2023 Collaborator

flybird11111 Dec 13, 2023 Collaborator

ShinoharaHare Dec 15, 2023 Author

flybird11111 Dec 15, 2023 Collaborator

ShinoharaHare Dec 15, 2023 Author

ShinoharaHare Dec 15, 2023 Author

flybird11111 Dec 19, 2023 Collaborator

ShinoharaHare Dec 14, 2023 Author

ShinoharaHare Dec 14, 2023 Author

ShinoharaHare Dec 15, 2023 Author

flybird11111 Dec 15, 2023 Collaborator

ShinoharaHare Dec 19, 2023 Author

flybird11111 Dec 19, 2023 Collaborator

flybird11111 Dec 21, 2023 Collaborator

ShinoharaHare
Dec 11, 2023

Replies: 4 comments 31 replies

kurisusnowdeng
Dec 12, 2023
Maintainer

ShinoharaHare Dec 12, 2023
Author

flybird11111 Dec 13, 2023
Collaborator

ShinoharaHare Dec 13, 2023
Author

ShinoharaHare Dec 14, 2023
Author

flybird11111 Dec 14, 2023
Collaborator

flybird11111
Dec 13, 2023
Collaborator

ShinoharaHare Dec 15, 2023
Author

flybird11111 Dec 15, 2023
Collaborator

ShinoharaHare Dec 15, 2023
Author

ShinoharaHare Dec 15, 2023
Author

flybird11111 Dec 19, 2023
Collaborator

ShinoharaHare
Dec 14, 2023
Author

ShinoharaHare Dec 14, 2023
Author

ShinoharaHare Dec 15, 2023
Author

flybird11111 Dec 15, 2023
Collaborator

ShinoharaHare Dec 19, 2023
Author

flybird11111 Dec 19, 2023
Collaborator

flybird11111
Dec 21, 2023
Collaborator