Skip to content

Commit

Permalink
fix: merging of model for multi-gpu (foundation-model-stack#158)
Browse files Browse the repository at this point in the history
* only copy over if adapter found, problem when lora multi-gpu train

Signed-off-by: Anh-Uong <[email protected]>

* formatting and helpful comment

Signed-off-by: Anh-Uong <[email protected]>

---------

Signed-off-by: Anh-Uong <[email protected]>
  • Loading branch information
anhuong authored May 15, 2024
1 parent 38c4f22 commit eba20f3
Showing 1 changed file with 10 additions and 6 deletions.
16 changes: 10 additions & 6 deletions build/launch_training.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,12 +142,16 @@ def main():
export_path,
)

create_merged_model(
checkpoint_models=full_checkpoint_dir,
export_path=export_path,
base_model=model_args.model_name_or_path,
save_tokenizer=True,
)
# ensure checkpoint dir has correct files, important with multi-gpu tuning
if os.path.exists(
os.path.join(full_checkpoint_dir, "adapter_config.json")
):
create_merged_model(
checkpoint_models=full_checkpoint_dir,
export_path=export_path,
base_model=model_args.model_name_or_path,
save_tokenizer=True,
)
except Exception as e: # pylint: disable=broad-except
logging.error(traceback.format_exc())
write_termination_log(
Expand Down

0 comments on commit eba20f3

Please sign in to comment.