-
Notifications
You must be signed in to change notification settings - Fork 80
Open
Labels
Description
System Info
trn1.2xlarge instance on AWS EC2
optimum-neuron version 0.0.25.dev0
transformers version 4.43.2
Amazon Linux 2023 AMI with python=3.9 AND Hugging Face Ubuntu 22.04 AMI with python=3.10
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction (minimal, reproducible, runnable)
I am trying to fine-tune Llama-3-8B on a single trn1.2xlarge instance. I am following the tutorial here: https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm but changing PROCESSES_PER_NODE and TP_DEGREE variables. My compilation script looks like this:
#!/bin/bash
set -ex
export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ec2-user/cache_dir_neuron/"
PROCESSES_PER_NODE=2
NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=2
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID
MAX_STEPS=25
XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE train.py \
--model_id $MODEL_NAME \
--num_train_epochs $NUM_EPOCHS \
--do_train \
--learning_rate 5e-5 \
--warmup_ratio 0.03 \
--max_steps $MAX_STEPS \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
--gradient_checkpointing true \
--bf16 \
--zero_1 false \
--tensor_parallel_size $TP_DEGREE \
--pipeline_parallel_size $PP_DEGREE \
--logging_steps $LOGGING_STEPS \
--save_total_limit 1 \
--output_dir $OUTPUT_DIR \
--lr_scheduler_type "constant" \
--overwrite_output_dir
however, during compilation of some graphs I get this error:
2024-10-02 17:35:11.000783: 103330 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.neff', '--model-type=transformer', '--distribution-strategy=llm-training', '--enable-saturate-infinity', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2024-10-02T17:35:11Z [TEN404] Internal tensorizer error: TritiumFusion:Should be able to fuse two loops! - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
I can compile and complete training without error if I set the batch_size to 1, however I would like to be able to increase the batch size to speed up training.
I also get these warnings which may be relevant:
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
[2024-10-05 19:58:00.706: W neuronx_distributed/parallel_layers/parallel_state.py:439] [rank_0_pp-1_tp-1_dp-1] Failed to initialize NKI parallel state with exception intra_layer_model parallel group is not initialized.Proceeding without distributed NKI support.
Expected behavior
I expect the model to compile and training script to function properly without error.