希望能够保持特定层的 weight 仍为 float32 #302

liaojianjin · 2022-02-21T07:04:44Z

有办法让 BatchNorm2d 之类的层保持 float32 进行训练吗？用 half 可能导致 loss 不好收敛

zhuzilin · 2022-02-22T02:31:50Z

可以在一部分结构上用 torch_scope 这个接口包一下，在 torch scope 里面的部分会用使用 fp32 进行训练，例如 moe 的例子里：

Lines 53 to 64 in 0731c6e

    
           # The MoE modules are mainly of model parallel, we need to use `torch_scope` 
        
           # to separate it from the other chunk based data parallel modules. 
        
           # Also, MoE modules will take cart of its own communication, that's why 
        
           # we need to disable allreduce in the torch scope. 
        
           with torch_scope(do_allreduce=False): 
        
               self.output = fmoe.FMoETransformerMLP( 
        
                   num_expert=2, 
        
                   world_size=get_world_size(), 
        
                   d_model=config.hidden_size, 
        
                   d_hidden=config.intermediate_size, 
        
                   gate=fmoe.gates.NaiveGate, 
        
               )

不过注意，如果只是要把一层设置为 fp32 的话，这里的 do_allreduce 应该设置为 True

Jack47 · 2022-03-03T06:23:34Z

可以在一部分结构上用 torch_scope 这个接口包一下，在 torch scope 里面的部分会用使用 fp32 进行训练，例如 moe 的例子里：

PatrickStar/examples/moe/moe_bert.py

Lines 53 to 64 in 0731c6e

# The MoE modules are mainly of model parallel, we need to use `torch_scope`

# to separate it from the other chunk based data parallel modules.

# Also, MoE modules will take cart of its own communication, that's why

# we need to disable allreduce in the torch scope.

with torch_scope(do_allreduce=False):

self.output = fmoe.FMoETransformerMLP(

num_expert=2,

world_size=get_world_size(),

d_model=config.hidden_size,

d_hidden=config.intermediate_size,

gate=fmoe.gates.NaiveGate,

)

不过注意，如果只是要把一层设置为 fp32 的话，这里的 do_allreduce 应该设置为 True

妙啊，意思是这块是torch在管理的，不需要ps参与？

liaojianjin · 2022-03-03T06:31:53Z

可以在一部分结构上用 torch_scope 这个接口包一下，在 torch scope 里面的部分会用使用 fp32 进行训练，例如 moe 的例子里：

PatrickStar/examples/moe/moe_bert.py

Lines 53 to 64 in 0731c6e

# The MoE modules are mainly of model parallel, we need to use `torch_scope`

# to separate it from the other chunk based data parallel modules.

# Also, MoE modules will take cart of its own communication, that's why

# we need to disable allreduce in the torch scope.

with torch_scope(do_allreduce=False):

self.output = fmoe.FMoETransformerMLP(

num_expert=2,

world_size=get_world_size(),

d_model=config.hidden_size,

d_hidden=config.intermediate_size,

gate=fmoe.gates.NaiveGate,

)

不过注意，如果只是要把一层设置为 fp32 的话，这里的 do_allreduce 应该设置为 True

妙啊，意思是这块是torch在管理的，不需要ps参与？

应该是的，torch_scope 把 config 做了个临时修改

PatrickStar/patrickstar/core/preprocess.py

Lines 80 to 86 in d2a5e1d

    
           def torch_scope(do_allreduce=True): 
        
               r"""All parameters initialized in this scope will not be managed in chunks.""" 
        
               _runtime_config.push() 
        
               _runtime_config.config["use_chunk"] = False 
        
               _runtime_config.config["do_allreduce"] = do_allreduce 
        
               yield 
        
               _runtime_config.pop()

在 Module init后将参数注册为torch管理，并且保持输入输出为float

PatrickStar/patrickstar/core/preprocess.py

Lines 366 to 375 in d2a5e1d

    
           if not _runtime_config.use_chunk: 
        
               for name, param in module.named_parameters(recurse=False): 
        
                   name = f"{module.__class__.__name__}.{name}_{self.param_idx}" 
        
                   register_param(param, ParamType.TORCH_BASED, torch.float, name) 
        
                   if _runtime_config.do_allreduce: 
        
                       self.client.torch_param_allreduce_list.append(param) 
        
               # We need to cast the inputs to fp32 for the unmanaged modules. 
        
               cast_forward(module, torch.float) 
        
               return

zhuzilin · 2022-03-03T07:26:37Z

@Jack47 @liaojianjin
最近我们在对派大星进行全面的重构...所以这些特性可能之后都会有些变化.. 例如我们可能之后会直接复用 pytorch autocast，而不是实现自己版本的混合精度训练了，这样的话本 issue 中提到的 layernorm 设置成 fp32 的问题可能就迎刃而解了，也不需要在迁移后重新对齐精度了。所以现在的暴露的接口可能比较简陋，非常抱歉...

Jack47 · 2022-03-03T08:39:37Z

@Jack47 @liaojianjin 最近我们在对派大星进行全面的重构...所以这些特性可能之后都会有些变化.. 例如我们可能之后会直接复用 pytorch autocast，而不是实现自己版本的混合精度训练了，这样的话本 issue 中提到的 layernorm 设置成 fp32 的问题可能就迎刃而解了，也不需要在迁移后重新对齐精度了。所以现在的暴露的接口可能比较简陋，非常抱歉...

好的好的，👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

希望能够保持特定层的 weight 仍为 float32 #302

希望能够保持特定层的 weight 仍为 float32 #302

liaojianjin commented Feb 21, 2022

zhuzilin commented Feb 22, 2022

Jack47 commented Mar 3, 2022

liaojianjin commented Mar 3, 2022

zhuzilin commented Mar 3, 2022

Jack47 commented Mar 3, 2022

希望能够保持特定层的 weight 仍为 float32 #302

希望能够保持特定层的 weight 仍为 float32 #302

Comments

liaojianjin commented Feb 21, 2022

zhuzilin commented Feb 22, 2022

Jack47 commented Mar 3, 2022

liaojianjin commented Mar 3, 2022

zhuzilin commented Mar 3, 2022

Jack47 commented Mar 3, 2022