Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformers and deepspeed dependency problem #35

Open
ks1212-rgb opened this issue Jan 3, 2025 · 3 comments
Open

Transformers and deepspeed dependency problem #35

ks1212-rgb opened this issue Jan 3, 2025 · 3 comments

Comments

@ks1212-rgb
Copy link

First of all, thank you for your work.

We would like to utilize this excellent model in our research. Specifically, we plan to use run_pretrain.py and run_pretrain.bash to pretrain the model and then extract sequence embeddings from the protein encoder.

However, we are currently unable to proceed due to version issues with transformers and deepspeed.
We followed a solution suggested in an existing issue, which advised not to use the deepspeed.py provided by OntoProtein and instead use the user's own deepspeed.
However, problems related to the version of transformers persist. For example, issues arise with transformers.deepspeed and src.optimization's get_scheduler.

Could you provide the environment file used during model development or share the specific versions of transformers and deepspeed that were used?

Thank you.

@Alexzhuan
Copy link
Collaborator

Hi,

Thanks for your interest in our work.

We have listed the recommended versions of transformers==4.5.1 and deepspeed==0.5.1 in the README.

@zxlzr
Copy link
Contributor

zxlzr commented Jan 3, 2025

hi do you have any further questions?

@ks1212-rgb
Copy link
Author

Hi,

Thank you for your previous guidance.
We installed transformers==4.5.1 and deepspeed==0.5.1 as recommended, but the issue persists. Below is the error message:

[2025-01-04 17:14:49,577] [WARNING] [runner.py:122:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-01-04 17:14:49,652] [INFO] [runner.py:360:main] cmd = /home/tech/miniconda3/envs/new_OntoProtein/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=29500 ../run_pretrain.py --do_train --output_dir ../data/output_data/filtered_ke_text --pretrain_data_dir ../data/pretrain_data --protein_seq_data_file_name swiss_seq --in_memory true --max_protein_seq_length 1024 --model_protein_seq_data true --model_protein_go_data true --model_go_go_data true --use_desc true --max_text_seq_length 128 --dataloader_protein_go_num_workers 1 --dataloader_go_go_num_workers 1 --dataloader_protein_seq_num_workers 1 --num_protein_go_neg_sample 128 --num_go_go_neg_sample 128 --negative_sampling_fn simple_random --protein_go_sample_head false --protein_go_sample_tail true --go_go_sample_head true --go_go_sample_tail true --protein_model_file_name ../data/model_data/ProtBERT --text_model_file_name ../data/model_data/OntoModel --go_encoder_cls bert --protein_encoder_cls bert --ke_embedding_size 512 --double_entity_embedding_size false --max_steps 60000 --per_device_train_batch_size 4 --weight_decay 0.01 --optimize_memory true --gradient_accumulation_steps 256 --lr_scheduler_type linear --mlm_lambda 1.0 --lm_learning_rate 1e-5 --lm_warmup_steps 50000 --ke_warmup_steps 50000 --ke_lambda 1.0 --ke_learning_rate 2e-5 --ke_max_score 12.0 --ke_score_fn transE --ke_warmup_ratio --seed 2021 --deepspeed dp_config.json --fp16 --dataloader_pin_memory
[2025-01-04 17:14:50,197] [INFO] [launch.py:80:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2025-01-04 17:14:50,197] [INFO] [launch.py:86:main] nnodes=1, num_local_procs=4, node_rank=0
[2025-01-04 17:14:50,197] [INFO] [launch.py:101:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2025-01-04 17:14:50,197] [INFO] [launch.py:102:main] dist_world_size=4
[2025-01-04 17:14:50,197] [INFO] [launch.py:104:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
Traceback (most recent call last):
  File "../run_pretrain.py", line 6, in <module>
    from src.models import OntoProteinPreTrainedModel
  File "/nas/autism/features/OntoProtein/OntoProtein/src/models.py", line 16, in <module>
    from transformers.deepspeed import is_deepspeed_zero3_enabled
ModuleNotFoundError: No module named 'transformers.deepspeed'
Traceback (most recent call last):
  File "../run_pretrain.py", line 6, in <module>
    from src.models import OntoProteinPreTrainedModel
  File "/nas/autism/features/OntoProtein/OntoProtein/src/models.py", line 16, in <module>
    from transformers.deepspeed import is_deepspeed_zero3_enabled
ModuleNotFoundError: No module named 'transformers.deepspeed'
Traceback (most recent call last):
  File "../run_pretrain.py", line 6, in <module>
    from src.models import OntoProteinPreTrainedModel
  File "/nas/autism/features/OntoProtein/OntoProtein/src/models.py", line 16, in <module>
    from transformers.deepspeed import is_deepspeed_zero3_enabled
ModuleNotFoundError: No module named 'transformers.deepspeed'
Traceback (most recent call last):
  File "../run_pretrain.py", line 6, in <module>
    from src.models import OntoProteinPreTrainedModel
  File "/nas/autism/features/OntoProtein/OntoProtein/src/models.py", line 16, in <module>
    from transformers.deepspeed import is_deepspeed_zero3_enabled
ModuleNotFoundError: No module named 'transformers.deepspeed'
Killing subprocess 1311078
Killing subprocess 1311079
Killing subprocess 1311080
Killing subprocess 1311081
Traceback (most recent call last):
  File "/home/tech/miniconda3/envs/new_OntoProtein/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/tech/miniconda3/envs/new_OntoProtein/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/tech/miniconda3/envs/new_OntoProtein/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 171, in <module>
    main()
  File "/home/tech/miniconda3/envs/new_OntoProtein/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 161, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/tech/miniconda3/envs/new_OntoProtein/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 139, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/tech/miniconda3/envs/new_OntoProtein/bin/python', '-u', '../run_pretrain.py', '--local_rank=3', '--do_train', '--output_dir', '../data/output_data/filtered_ke_text', '--pretrain_data_dir', '../data/pretrain_data', '--protein_seq_data_file_name', 'swiss_seq', '--in_memory', 'true', '--max_protein_seq_length', '1024', '--model_protein_seq_data', 'true', '--model_protein_go_data', 'true', '--model_go_go_data', 'true', '--use_desc', 'true', '--max_text_seq_length', '128', '--dataloader_protein_go_num_workers', '1', '--dataloader_go_go_num_workers', '1', '--dataloader_protein_seq_num_workers', '1', '--num_protein_go_neg_sample', '128', '--num_go_go_neg_sample', '128', '--negative_sampling_fn', 'simple_random', '--protein_go_sample_head', 'false', '--protein_go_sample_tail', 'true', '--go_go_sample_head', 'true', '--go_go_sample_tail', 'true', '--protein_model_file_name', '../data/model_data/ProtBERT', '--text_model_file_name', '../data/model_data/OntoModel', '--go_encoder_cls', 'bert', '--protein_encoder_cls', 'bert', '--ke_embedding_size', '512', '--double_entity_embedding_size', 'false', '--max_steps', '60000', '--per_device_train_batch_size', '4', '--weight_decay', '0.01', '--optimize_memory', 'true', '--gradient_accumulation_steps', '256', '--lr_scheduler_type', 'linear', '--mlm_lambda', '1.0', '--lm_learning_rate', '1e-5', '--lm_warmup_steps', '50000', '--ke_warmup_steps', '50000', '--ke_lambda', '1.0', '--ke_learning_rate', '2e-5', '--ke_max_score', '12.0', '--ke_score_fn', 'transE', '--ke_warmup_ratio', '--seed', '2021', '--deepspeed', 'dp_config.json', '--fp16', '--dataloader_pin_memory']' returned non-zero exit status 1.

To address this, we tried upgrading to the latest versions of transformers and deepspeed, and using transformers.integrations.deepspeed. However, this requires significant changes to the official codebase, including files like trainer.py and run_pretrain.py.

Is there a way to resolve this issue while keeping the official code intact?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants