Trouble when merging datasets. The whole program was stuck with all gpus 100% utils #2234

CopyNinja1999 · 2025-01-03T08:41:38Z

CopyNinja1999
Jan 3, 2025

The log was like:

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
def forward(ctx, input, weight, bias=None):
/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
def backward(ctx, grad_output):
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
from torch.distributed.optim import ZeroRedundancyOptimizer
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
from torch.distributed.optim import ZeroRedundancyOptimizer
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
from torch.distributed.optim import ZeroRedundancyOptimizer
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning: TorchScript support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
from torch.distributed.optim import ZeroRedundancyOptimizer
[2025-01-03 16:18:24,359] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1987] [RANK:1] setting remove_unused_columns: false for when sample_packing and eval_sample_packing don't match
[2025-01-03 16:18:24,359] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1987] [RANK:1] trust_remote_code is set to true. Please make sure that you reviewed the remote code/model.
[2025-01-03 16:18:24,360] [DEBUG] [axolotl.normalize_config:80] [PID:1987] [RANK:1] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_config:183] [PID:1987] [RANK:1] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1987] [RANK:1] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1987] [RANK:1] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1987] [RANK:1] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,405] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,416] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1989] [RANK:3] setting remove_unused_columns: false for when sample_packing and eval_sample_packing don't match
[2025-01-03 16:18:24,416] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1989] [RANK:3] trust_remote_code is set to true. Please make sure that you reviewed the remote code/model.
[2025-01-03 16:18:24,417] [DEBUG] [axolotl.normalize_config:80] [PID:1989] [RANK:3] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_config:183] [PID:1989] [RANK:3] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1989] [RANK:3] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1989] [RANK:3] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1989] [RANK:3] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,424] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,447] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1987] [RANK:1] Error verifying HuggingFace token. Remember to log in using huggingface-cli login and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2025-01-03 16:18:24,472] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1989] [RANK:3] Error verifying HuggingFace token. Remember to log in using huggingface-cli login and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2025-01-03 16:18:24,493] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1988] [RANK:2] setting remove_unused_columns: false for when sample_packing and eval_sample_packing don't match
[2025-01-03 16:18:24,494] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1988] [RANK:2] trust_remote_code is set to true. Please make sure that you reviewed the remote code/model.
[2025-01-03 16:18:24,495] [DEBUG] [axolotl.normalize_config:80] [PID:1988] [RANK:2] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,523] [INFO] [axolotl.normalize_config:183] [PID:1988] [RANK:2] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,523] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1988] [RANK:2] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,523] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1988] [RANK:2] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,524] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1988] [RANK:2] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,529] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,577] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1988] [RANK:2] Error verifying HuggingFace token. Remember to log in using huggingface-cli login and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2025-01-03 16:18:24,827] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1986] [RANK:0] setting remove_unused_columns: false for when sample_packing and eval_sample_packing don't match
[2025-01-03 16:18:24,827] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1986] [RANK:0] trust_remote_code is set to true. Please make sure that you reviewed the remote code/model.
[2025-01-03 16:18:24,828] [DEBUG] [axolotl.normalize_config:80] [PID:1986] [RANK:0] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_config:183] [PID:1986] [RANK:0] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1986] [RANK:0] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1986] [RANK:0] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1986] [RANK:0] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with chat_template: chatml to match your chat_template
[2025-01-03 16:18:24,893] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,893] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88' 88 8bd8' 88' 88 88 88' 88 88 88
88. .88 .d88b. 88. .88 88 88. .88 88 88
88888P8 dP' dP 88888P' dP 88888P' dP dP

**** Axolotl Dependency Versions *****
accelerate: 0.32.0
peft: 0.11.1
transformers: 4.44.0.dev0
trl: 0.9.6
torch: 2.4.0+cu118
bitsandbytes: 0.43.3

[2025-01-03 16:18:24,950] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1986] [RANK:0] Error verifying HuggingFace token. Remember to log in using huggingface-cli login and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.
[2025-01-03 16:18:25,369] [DEBUG] [axolotl.load_tokenizer:282] [PID:1987] [RANK:1] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:25,370] [DEBUG] [axolotl.load_tokenizer:283] [PID:1987] [RANK:1] BOS: None / None
[2025-01-03 16:18:25,370] [DEBUG] [axolotl.load_tokenizer:284] [PID:1987] [RANK:1] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:25,370] [DEBUG] [axolotl.load_tokenizer:285] [PID:1987] [RANK:1] UNK: None / None
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:282] [PID:1986] [RANK:0] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:283] [PID:1986] [RANK:0] BOS: None / None
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:284] [PID:1986] [RANK:0] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:285] [PID:1986] [RANK:0] UNK: None / None
[2025-01-03 16:18:26,910] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:1986] [RANK:0] Unable to find prepared dataset in last_run_prepared/ad947a7fb0f6d5ef41bbea482d26499b
[2025-01-03 16:18:26,910] [INFO] [axolotl.load_tokenized_prepared_datasets:185] [PID:1986] [RANK:0] Loading raw datasets...
[2025-01-03 16:18:26,910] [WARNING] [axolotl.load_tokenized_prepared_datasets:187] [PID:1986] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2025-01-03 16:18:26,910] [INFO] [axolotl.load_tokenized_prepared_datasets:194] [PID:1986] [RANK:0] No seed provided, using default seed of 42
[2025-01-03 16:18:26,928] [DEBUG] [axolotl.load_tokenizer:282] [PID:1989] [RANK:3] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:26,929] [DEBUG] [axolotl.load_tokenizer:283] [PID:1989] [RANK:3] BOS: None / None
[2025-01-03 16:18:26,929] [DEBUG] [axolotl.load_tokenizer:284] [PID:1989] [RANK:3] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:26,929] [DEBUG] [axolotl.load_tokenizer:285] [PID:1989] [RANK:3] UNK: None / None
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:282] [PID:1988] [RANK:2] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:283] [PID:1988] [RANK:2] BOS: None / None
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:284] [PID:1988] [RANK:2] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:285] [PID:1988] [RANK:2] UNK: None / None
[2025-01-03 16:18:27,714] [INFO] [axolotl.get_dataset_wrapper:545] [PID:1986] [RANK:0] Loading dataset with base_type: chat_template and prompt_style: None
[2025-01-03 16:18:28,767] [INFO] [axolotl.get_dataset_wrapper:545] [PID:1986] [RANK:0] Loading dataset with base_type: chat_template and prompt_style: None
[2025-01-03 16:18:29,837] [INFO] [axolotl.get_dataset_wrapper:545] [PID:1986] [RANK:0] Loading dataset with base_type: chat_template and prompt_style: None
[2025-01-03 16:18:30,093] [INFO] [axolotl.load_tokenized_prepared_datasets:419] [PID:1986] [RANK:0] merging datasets

The program stuck here forever long. Wonder what is the reason.

NanoCode012 · 2025-01-03T08:49:48Z

NanoCode012
Jan 3, 2025
Collaborator

Hello, could you let me know which command you used to run this? Secondly, how large are your datasets?

Could you try, run preprocess first (make sure to set dataset_prepared_path and CUDA_VISIBLE_DEVICES=0) and then run train second?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trouble when merging datasets. The whole program was stuck with all gpus 100% utils #2234

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Trouble when merging datasets. The whole program was stuck with all gpus 100% utils #2234

CopyNinja1999 Jan 3, 2025

Replies: 1 comment

NanoCode012 Jan 3, 2025 Collaborator

CopyNinja1999
Jan 3, 2025

NanoCode012
Jan 3, 2025
Collaborator