Skip to content

Move neuron_parallel_compile outside of bash script #706

@jgray-aws

Description

@jgray-aws

Currently, the tutorial call neuron_parallel_compile inside of the bash script. Because neuron_parallel_compile is responsible for setting $NEURON_EXTRACT_GRAPHS_ONLY, this causes the MAX_STEPS set to -1, causing compilation to run for >1 hour.

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=$((LOGGING_STEPS + 5))
else
    MAX_STEPS=-1
fi

```bash
#!/bin/bash
set -ex
export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"
PROCESSES_PER_NODE=8
NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID
if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
MAX_STEPS=$((LOGGING_STEPS + 5))
else
MAX_STEPS=-1
fi
XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
--model_id $MODEL_NAME \
--num_train_epochs $NUM_EPOCHS \
--do_train \
--learning_rate 5e-5 \
--warmup_ratio 0.03 \
--max_steps $MAX_STEPS \
--per_device_train_batch_size $BS \
--per_device_eval_batch_size $BS \
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
--gradient_checkpointing true \
--bf16 \
--zero_1 false \
--tensor_parallel_size $TP_DEGREE \
--pipeline_parallel_size $PP_DEGREE \
--logging_steps $LOGGING_STEPS \
--save_total_limit 1 \
--output_dir $OUTPUT_DIR \
--lr_scheduler_type "constant" \
--overwrite_output_dir
```

We need to refactor the tutorial to call neuron_parallel_compile on the training script.

Example can be found here:

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tutorials/training_llama_tp_zero1.html

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions