Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

希望能够保持特定层的 weight 仍为 float32 #302

Open
liaojianjin opened this issue Feb 21, 2022 · 5 comments
Open

希望能够保持特定层的 weight 仍为 float32 #302

liaojianjin opened this issue Feb 21, 2022 · 5 comments

Comments

@liaojianjin
Copy link

有办法让 BatchNorm2d 之类的层保持 float32 进行训练吗?用 half 可能导致 loss 不好收敛

@zhuzilin
Copy link
Collaborator

可以在一部分结构上用 torch_scope 这个接口包一下,在 torch scope 里面的部分会用使用 fp32 进行训练,例如 moe 的例子里:

# The MoE modules are mainly of model parallel, we need to use `torch_scope`
# to separate it from the other chunk based data parallel modules.
# Also, MoE modules will take cart of its own communication, that's why
# we need to disable allreduce in the torch scope.
with torch_scope(do_allreduce=False):
self.output = fmoe.FMoETransformerMLP(
num_expert=2,
world_size=get_world_size(),
d_model=config.hidden_size,
d_hidden=config.intermediate_size,
gate=fmoe.gates.NaiveGate,
)

不过注意,如果只是要把一层设置为 fp32 的话,这里的 do_allreduce 应该设置为 True

@Jack47
Copy link

Jack47 commented Mar 3, 2022

可以在一部分结构上用 torch_scope 这个接口包一下,在 torch scope 里面的部分会用使用 fp32 进行训练,例如 moe 的例子里:

# The MoE modules are mainly of model parallel, we need to use `torch_scope`
# to separate it from the other chunk based data parallel modules.
# Also, MoE modules will take cart of its own communication, that's why
# we need to disable allreduce in the torch scope.
with torch_scope(do_allreduce=False):
self.output = fmoe.FMoETransformerMLP(
num_expert=2,
world_size=get_world_size(),
d_model=config.hidden_size,
d_hidden=config.intermediate_size,
gate=fmoe.gates.NaiveGate,
)

不过注意,如果只是要把一层设置为 fp32 的话,这里的 do_allreduce 应该设置为 True

妙啊,意思是这块是torch在管理的,不需要ps参与?

@liaojianjin
Copy link
Author

可以在一部分结构上用 torch_scope 这个接口包一下,在 torch scope 里面的部分会用使用 fp32 进行训练,例如 moe 的例子里:

# The MoE modules are mainly of model parallel, we need to use `torch_scope`
# to separate it from the other chunk based data parallel modules.
# Also, MoE modules will take cart of its own communication, that's why
# we need to disable allreduce in the torch scope.
with torch_scope(do_allreduce=False):
self.output = fmoe.FMoETransformerMLP(
num_expert=2,
world_size=get_world_size(),
d_model=config.hidden_size,
d_hidden=config.intermediate_size,
gate=fmoe.gates.NaiveGate,
)

不过注意,如果只是要把一层设置为 fp32 的话,这里的 do_allreduce 应该设置为 True

妙啊,意思是这块是torch在管理的,不需要ps参与?

应该是的,torch_scope 把 config 做了个临时修改

def torch_scope(do_allreduce=True):
r"""All parameters initialized in this scope will not be managed in chunks."""
_runtime_config.push()
_runtime_config.config["use_chunk"] = False
_runtime_config.config["do_allreduce"] = do_allreduce
yield
_runtime_config.pop()

在 Module init后将参数注册为torch管理,并且保持输入输出为float
if not _runtime_config.use_chunk:
for name, param in module.named_parameters(recurse=False):
name = f"{module.__class__.__name__}.{name}_{self.param_idx}"
register_param(param, ParamType.TORCH_BASED, torch.float, name)
if _runtime_config.do_allreduce:
self.client.torch_param_allreduce_list.append(param)
# We need to cast the inputs to fp32 for the unmanaged modules.
cast_forward(module, torch.float)
return

@zhuzilin
Copy link
Collaborator

zhuzilin commented Mar 3, 2022

@Jack47 @liaojianjin
最近我们在对派大星进行全面的重构...所以这些特性可能之后都会有些变化.. 例如我们可能之后会直接复用 pytorch autocast,而不是实现自己版本的混合精度训练了,这样的话本 issue 中提到的 layernorm 设置成 fp32 的问题可能就迎刃而解了,也不需要在迁移后重新对齐精度了。所以现在的暴露的接口可能比较简陋,非常抱歉...

@Jack47
Copy link

Jack47 commented Mar 3, 2022

@Jack47 @liaojianjin 最近我们在对派大星进行全面的重构...所以这些特性可能之后都会有些变化.. 例如我们可能之后会直接复用 pytorch autocast,而不是实现自己版本的混合精度训练了,这样的话本 issue 中提到的 layernorm 设置成 fp32 的问题可能就迎刃而解了,也不需要在迁移后重新对齐精度了。所以现在的暴露的接口可能比较简陋,非常抱歉...

好的好的,👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants