can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1

### System Info

```shell
trn1.2xlarge instance on AWS EC2
optimum-neuron version 0.0.25.dev0
transformers version 4.43.2
Amazon Linux 2023 AMI with python=3.9 AND Hugging Face Ubuntu 22.04 AMI with python=3.10
```


### Who can help?

@michaelbenayoun

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

I am trying to fine-tune Llama-3-8B on a single trn1.2xlarge instance. I am following the tutorial here: https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm but changing PROCESSES_PER_NODE and TP_DEGREE variables. My compilation script looks like this:
```bash
#!/bin/bash
set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ec2-user/cache_dir_neuron/"

PROCESSES_PER_NODE=2

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=2
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

MAX_STEPS=25

XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE train.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir
```
however, during compilation of some graphs I get this error:
```
2024-10-02 17:35:11.000783:  103330  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/22de144e-d107-4885-bf01-4abe86f47a37/model.MODULE_10406581693136771780+6d1be540.neff', '--model-type=transformer', '--distribution-strategy=llm-training', '--enable-saturate-infinity', '--model-type=transformer', '--model-type=transformer', '--verbose=35']: 2024-10-02T17:35:11Z [TEN404] Internal tensorizer error: TritiumFusion:Should be able to fuse two loops! - Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
```
I can compile and complete training without error if I set the batch_size to 1, however I would like to be able to increase the batch size to speed up training.
I also get these warnings which may be relevant:
```
torch.distributed process group is initialized, but parallel_mode != ParallelMode.DISTRIBUTED. In order to use Torch DDP, launch your script with `python -m torch.distributed.launch
[2024-10-05 19:58:00.706: W neuronx_distributed/parallel_layers/parallel_state.py:439] [rank_0_pp-1_tp-1_dp-1] Failed to initialize NKI parallel state with exception intra_layer_model parallel group is not initialized.Proceeding without distributed NKI support.
```

### Expected behavior

I expect the model to compile and training script to function properly without error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

can't compile llama-3-8B or llama-3.1-8B with lora if batch size is more than 1 #709

Description

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions