Skip to content

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

@anilozlu

Description

@anilozlu

System Info

trn1.2xlarge instance on AWS EC2
optimum-neuron version 0.0.25.dev0
transformers version 4.43.2
Amazon Linux 2023 AMI with python=3.9 AND Hugging Face Ubuntu 22.04 AMI with python=3.10

Who can help?

@michaelbenayoun

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I am trying to fine-tune Llama-3-8B on a single trn1.2xlarge instance. I am following the tutorial here: https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm but changing PROCESSES_PER_NODE and TP_DEGREE variables. My compilation script looks like this:

#!/bin/bash
set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ec2-user/cache_dir_neuron/"

PROCESSES_PER_NODE=2

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=2
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

MAX_STEPS=25

XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE train.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir

however, during compilation of some graphs I get this error:

2024-10-02 17:35:11.000783:  103330  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.neff', '--model-type=transformer', '--distribution-strategy=llm-training', '--enable-saturate-infinity', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2024-10-02T17:35:11Z [TEN404] Internal tensorizer error: TritiumFusion:Should be able to fuse two loops! - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.

I can compile and complete training without error if I set the batch_size to 1, however I would like to be able to increase the batch size to speed up training.
I also get these warnings which may be relevant:

torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
[2024-10-05 19:58:00.706: W neuronx_distributed/parallel_layers/parallel_state.py:439] [rank_0_pp-1_tp-1_dp-1] Failed to initialize NKI parallel state with exception intra_layer_model parallel group is not initialized.Proceeding without distributed NKI support.

Expected behavior

I expect the model to compile and training script to function properly without error.

Metadata

Metadata

Labels

StalebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions