Trouble when merging datasets. The whole program was stuck with all gpus 100% utils #2234
Unanswered
CopyNinja1999
asked this question in
Q&A
Replies: 1 comment
-
Hello, could you let me know which command you used to run this? Secondly, how large are your datasets? Could you try, run preprocess first (make sure to set |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The log was like:
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning:
torch.cuda.amp.custom_fwd(args...)
is deprecated. Please usetorch.amp.custom_fwd(args..., device_type='cuda')
instead.def forward(ctx, input, weight, bias=None):
/root/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning:
torch.cuda.amp.custom_bwd(args...)
is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')
instead.def backward(ctx, grad_output):
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning:
TorchScript
support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using thetorch.compile
optimizer instead.from torch.distributed.optim import ZeroRedundancyOptimizer
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning:
TorchScript
support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using thetorch.compile
optimizer instead.from torch.distributed.optim import ZeroRedundancyOptimizer
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning:
TorchScript
support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using thetorch.compile
optimizer instead.from torch.distributed.optim import ZeroRedundancyOptimizer
/root/axolotl/src/axolotl/monkeypatch/relora.py:16: DeprecationWarning:
TorchScript
support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using thetorch.compile
optimizer instead.from torch.distributed.optim import ZeroRedundancyOptimizer
[2025-01-03 16:18:24,359] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1987] [RANK:1] setting
remove_unused_columns: false
for when sample_packing and eval_sample_packing don't match[2025-01-03 16:18:24,359] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1987] [RANK:1]
trust_remote_code
is set to true. Please make sure that you reviewed the remote code/model.[2025-01-03 16:18:24,360] [DEBUG] [axolotl.normalize_config:80] [PID:1987] [RANK:1] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_config:183] [PID:1987] [RANK:1] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1987] [RANK:1] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1987] [RANK:1] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,399] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1987] [RANK:1] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,405] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,416] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1989] [RANK:3] setting
remove_unused_columns: false
for when sample_packing and eval_sample_packing don't match[2025-01-03 16:18:24,416] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1989] [RANK:3]
trust_remote_code
is set to true. Please make sure that you reviewed the remote code/model.[2025-01-03 16:18:24,417] [DEBUG] [axolotl.normalize_config:80] [PID:1989] [RANK:3] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_config:183] [PID:1989] [RANK:3] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1989] [RANK:3] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1989] [RANK:3] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,420] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1989] [RANK:3] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,424] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,447] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1987] [RANK:1] Error verifying HuggingFace token. Remember to log in using
huggingface-cli login
and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.[2025-01-03 16:18:24,472] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1989] [RANK:3] Error verifying HuggingFace token. Remember to log in using
huggingface-cli login
and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.[2025-01-03 16:18:24,493] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1988] [RANK:2] setting
remove_unused_columns: false
for when sample_packing and eval_sample_packing don't match[2025-01-03 16:18:24,494] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1988] [RANK:2]
trust_remote_code
is set to true. Please make sure that you reviewed the remote code/model.[2025-01-03 16:18:24,495] [DEBUG] [axolotl.normalize_config:80] [PID:1988] [RANK:2] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,523] [INFO] [axolotl.normalize_config:183] [PID:1988] [RANK:2] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,523] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1988] [RANK:2] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,523] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1988] [RANK:2] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,524] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1988] [RANK:2] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,529] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,577] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1988] [RANK:2] Error verifying HuggingFace token. Remember to log in using
huggingface-cli login
and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.[2025-01-03 16:18:24,827] [INFO] [axolotl.utils.config.models.input.check_eval_packing:973] [PID:1986] [RANK:0] setting
remove_unused_columns: false
for when sample_packing and eval_sample_packing don't match[2025-01-03 16:18:24,827] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:327] [PID:1986] [RANK:0]
trust_remote_code
is set to true. Please make sure that you reviewed the remote code/model.[2025-01-03 16:18:24,828] [DEBUG] [axolotl.normalize_config:80] [PID:1986] [RANK:0] bf16 support detected, enabling for this configuration.
[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_config:183] [PID:1986] [RANK:0] GPU memory usage baseline: 0.000GB (+0.460GB misc)
[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1986] [RANK:0] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_pretrain.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1986] [RANK:0] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_public.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,887] [INFO] [axolotl.normalize_cfg_datasets:203] [PID:1986] [RANK:0] updating dataset /root/autodl-tmp/gnlu/data/axolotl_training/mbc_simple_v1/mbc_simple_v1_context_free.jsonl with
chat_template: chatml
to match your chat_template[2025-01-03 16:18:24,893] [INFO] [comm.py:637:init_distributed] cdb=None
[2025-01-03 16:18:24,893] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
dP dP dP
88 88 88
.d8888b. dP. .dP .d8888b. 88 .d8888b. d8888P 88
88'
88
8bd8' 88'88 88 88'
88 88 8888. .88 .d88b. 88. .88 88 88. .88 88 88
88888P8 dP'
dP88888P' dP
88888P' dP dP**** Axolotl Dependency Versions *****
accelerate: 0.32.0
peft: 0.11.1
transformers: 4.44.0.dev0
trl: 0.9.6
torch: 2.4.0+cu118
bitsandbytes: 0.43.3
[2025-01-03 16:18:24,950] [WARNING] [axolotl.scripts.check_user_token:489] [PID:1986] [RANK:0] Error verifying HuggingFace token. Remember to log in using
huggingface-cli login
and get your access token from https://huggingface.co/settings/tokens if you want to use gated models or datasets.[2025-01-03 16:18:25,369] [DEBUG] [axolotl.load_tokenizer:282] [PID:1987] [RANK:1] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:25,370] [DEBUG] [axolotl.load_tokenizer:283] [PID:1987] [RANK:1] BOS: None / None
[2025-01-03 16:18:25,370] [DEBUG] [axolotl.load_tokenizer:284] [PID:1987] [RANK:1] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:25,370] [DEBUG] [axolotl.load_tokenizer:285] [PID:1987] [RANK:1] UNK: None / None
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:282] [PID:1986] [RANK:0] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:283] [PID:1986] [RANK:0] BOS: None / None
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:284] [PID:1986] [RANK:0] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:26,910] [DEBUG] [axolotl.load_tokenizer:285] [PID:1986] [RANK:0] UNK: None / None
[2025-01-03 16:18:26,910] [INFO] [axolotl.load_tokenized_prepared_datasets:184] [PID:1986] [RANK:0] Unable to find prepared dataset in last_run_prepared/ad947a7fb0f6d5ef41bbea482d26499b
[2025-01-03 16:18:26,910] [INFO] [axolotl.load_tokenized_prepared_datasets:185] [PID:1986] [RANK:0] Loading raw datasets...
[2025-01-03 16:18:26,910] [WARNING] [axolotl.load_tokenized_prepared_datasets:187] [PID:1986] [RANK:0] Processing datasets during training can lead to VRAM instability. Please pre-process your dataset.
[2025-01-03 16:18:26,910] [INFO] [axolotl.load_tokenized_prepared_datasets:194] [PID:1986] [RANK:0] No seed provided, using default seed of 42
[2025-01-03 16:18:26,928] [DEBUG] [axolotl.load_tokenizer:282] [PID:1989] [RANK:3] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:26,929] [DEBUG] [axolotl.load_tokenizer:283] [PID:1989] [RANK:3] BOS: None / None
[2025-01-03 16:18:26,929] [DEBUG] [axolotl.load_tokenizer:284] [PID:1989] [RANK:3] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:26,929] [DEBUG] [axolotl.load_tokenizer:285] [PID:1989] [RANK:3] UNK: None / None
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:282] [PID:1988] [RANK:2] EOS: 151645 / <|im_end|>
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:283] [PID:1988] [RANK:2] BOS: None / None
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:284] [PID:1988] [RANK:2] PAD: 151643 / <|endoftext|>
[2025-01-03 16:18:26,930] [DEBUG] [axolotl.load_tokenizer:285] [PID:1988] [RANK:2] UNK: None / None
[2025-01-03 16:18:27,714] [INFO] [axolotl.get_dataset_wrapper:545] [PID:1986] [RANK:0] Loading dataset with base_type: chat_template and prompt_style: None
[2025-01-03 16:18:28,767] [INFO] [axolotl.get_dataset_wrapper:545] [PID:1986] [RANK:0] Loading dataset with base_type: chat_template and prompt_style: None
[2025-01-03 16:18:29,837] [INFO] [axolotl.get_dataset_wrapper:545] [PID:1986] [RANK:0] Loading dataset with base_type: chat_template and prompt_style: None
[2025-01-03 16:18:30,093] [INFO] [axolotl.load_tokenized_prepared_datasets:419] [PID:1986] [RANK:0] merging datasets
The program stuck here forever long. Wonder what is the reason.
Beta Was this translation helpful? Give feedback.
All reactions