Skip to content

Llama3-8B finetuning shows runtime error of TDRV:v2_cc_execute  #658

@jianyinglangaws

Description

@jianyinglangaws

System Info

The same script works with `Neuron SDK 2.18.0` and `optimum-neuronx v0.0.22`.  But with the latest software stack 

(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ yum list | grep neuron
aws-neuronx-collectives.x86_64                                    2.21.46.0_69b77134b-1                       @neuron         
aws-neuronx-dkms.noarch                                           2.17.17.0-dkms                              @neuron         
aws-neuronx-runtime-lib.x86_64                                    2.21.41.0_fb1705f5f-1                       @neuron         
aws-neuronx-tools.x86_64                                          2.18.3.0-1                                  @neuron         
aws-neuron-dkms.noarch                                            2.3.26.0-dkms                               neuron          
aws-neuron-dkms.src                                               2.3.26.0-dkms                               neuron          
aws-neuron-k8-plugin.x86_64                                       1.9.3.0-1                                   neuron          
aws-neuron-k8-scheduler.x86_64                                    1.9.3.0-1                                   neuron          
aws-neuron-runtime.x86_64                                         1.6.24.0-1                                  neuron          
aws-neuron-runtime-base.x86_64                                    1.6.21.0-1                                  neuron          
aws-neuron-tools.x86_64                                           2.1.4.0-1                                   neuron          
aws-neuronx-dkms.src                                              2.17.17.0-dkms                              neuron          
aws-neuronx-gpsimd-customop.x86_64                                0.2.3.0-1                                   neuron          
aws-neuronx-gpsimd-customop-lib.x86_64                            0.11.4.0-1                                  neuron          
aws-neuronx-gpsimd-tools.x86_64                                   0.11.3.0_36dcb86d4-1                        neuron          
aws-neuronx-k8-plugin.x86_64                                      2.21.14.0-1                                 neuron          
aws-neuronx-k8-scheduler.x86_64                                   2.21.14.0-1                                 neuron          
aws-neuronx-oci-hook.x86_64                                       2.4.4.0-1                                   neuron          
tensorflow-model-server-neuron.x86_64                             2.8.0.2.3.0.0-0                             neuron          
tensorflow-model-server-neuronx.x86_64                            2.10.1.2.11.4.0-0                           neuron       
(aws_neuron_venv_pytorch) [ec2-user@ip-172-31-29-22 text-generation]$ pip list | grep neuron
aws-neuronx-runtime-discovery 2.9
libneuronxla                  2.0.2335
neuronx-cc                    2.13.66.0+6dfecc895
neuronx-distributed           0.7.0
optimum-neuron                0.0.23
torch-neuronx                 2.1.2.2.1.0
transformers-neuronx          0.10.0.21

gives the following error.

745142719040221994+6bd63055/model.neff. Exiting with a successfully compiled graph.
2024-Jul-17 22:00:32.531450 57376:58367 ERROR  TDRV:v2_cc_execute                           [nec_dev 1, gid 1] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.5314452024-Jul-17 22:00:32.5314692024-Jul-17 22:00:32.5314502024-Jul-17 22:00:32.5314522024-Jul-17 22:00:32.531461 57380:58467 ERROR  TDRV:v2_cc_execute                            57381:57583 ERROR  TDRV:v2_cc_execute                           
 57379:57681 ERROR  TDRV:v2_cc_execute                            57378:58269 ERROR  TDRV:v2_cc_execute                           [nec_dev 4, gid 4] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 5, gid 5] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff 57382:57461 ERROR  TDRV:v2_cc_execute                           [nec_dev 6, gid 6] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 3, gid 3] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff
[nec_dev 7, gid 7] MPMD execution is not supported. This is likely caused for some but not all ranks recompiling/reloading a graph, model: /tmp/ec2-user/neuroncc_compile_workdir/59a5b5cd-fff2-4315-a603-8a152f5186ca/model.MODULE_12429740934125521760+6bd63055.neff2024-Jul-17 22:00:32.531563


 57376:58367 ERROR   ENC:enc_dump_neff_info                      [nec_dev 1, gid 1] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/952ba576-b08c-4ac0-b0f1-12e560e5e362/model.MODULE_11203961494150985019+6bd63055.neff2024-Jul-17 22:00:32.531607
2024-Jul-17 22:00:32.5316292024-Jul-17 22:00:32.5316332024-Jul-17 22:00:32.531631 57379:57681 ERROR   ENC:enc_dump_neff_info                       57378:58269 ERROR   ENC:enc_dump_neff_info                      
 57381:57583 ERROR   ENC:enc_dump_neff_info                       57380:58467 ERROR   ENC:enc_dump_neff_info                      [nec_dev 4, gid 4] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/ebe15e9e-3f50-4061-8ef1-9c80d9a78071/model.MODULE_12660838522657173708+6bd63055.neff[nec_dev 3, gid 3] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/43f12c55-64e6-45ad-a002-a0676ed72df9/model.MODULE_5948348338269179475+6bd63055.neff[nec_dev 6, gid 6] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/eac078e9-79e3-49af-8695-c65797ca89c0/model.MODULE_3281029225498615900+6bd63055.neff[nec_dev 5, gid 5] NEFF: /tmp/ec2-user/neuroncc_compile_workdir/a12cbe53-3042-4c9a-86d4-4bbcac795471/model.MODULE_15938951179947649509+6bd63055.neff2024-Jul-17 22:00:32.531670 57382:57461 ERROR   ENC:enc_dump_neff_info                      
2024-Jul-17 22:00:32.531701 57376:58367 ERROR   ENC:enc_dump_neff_info    


### Who can help?

_No response_

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [X] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction (minimal, reproducible, runnable)

The script I used is as below:

Launch the instance with Amazon Linux2023
Install the deps using the following script

Configure Linux for Neuron repository updates

sudo tee /etc/yum.repos.d/neuron.repo > /dev/null <<EOF
[neuron]
name=Neuron YUM Repository
baseurl=https://yum.repos.neuron.amazonaws.com
enabled=1
metadata_expire=0
EOF
sudo rpm --import https://yum.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB

Update OS packages

sudo yum update -y

Install OS headers

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r) -y

Install git

sudo yum install git -y

install Neuron Driver

sudo yum install aws-neuronx-dkms-2.* -y

Install Neuron Runtime

sudo yum install aws-neuronx-collectives-2.* -y
sudo yum install aws-neuronx-runtime-lib-2.* -y

Install Neuron Tools

sudo yum install aws-neuronx-tools-2.* -y

#Create python3 venv
sudo yum install -y libxcrypt-compat
sudo yum install -y gcc-c++
python3 -m venv /home/ec2-user/aws_neuron_venv_pytorch

#Activate venv
source ~/aws_neuron_venv_pytorch/bin/activate

python -m pip install -U pip

Install Jupyter notebook kernel

pip install ipykernel
python3 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels

Set pip repository pointing to the Neuron repository

python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

Install wget, awscli

python -m pip install wget
python -m pip install awscli

Install Neuron Compiler and Framework

python -m pip install neuronx-cc==2.* torch-neuronx torchvision

#Install optmimum-neuronx
pip3 install --upgrade-strategy eager optimum[neuronx]

Download scripts

git clone https://github.com/huggingface/optimum-neuron.git

cd optimum-neuron/notebooks/text-generation/

Login with your huggingface token ID to download gated models

huggingface-cli login --token YOUR_TOKEN

Create a python3 file download_data.py to download and prcoess dataset under directory optimum-neuron/notebooks/text-generation/:

from datasets import load_dataset
from random import randrange

Load dataset from the hub

dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

print(f"dataset size: {len(dataset)}")
print(dataset[randrange(len(dataset))])

def format_dolly(sample):
instruction = f"### Instruction\n{sample['instruction']}"
context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None
response = f"### Answer\n{sample['response']}"
# join all the parts together
prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])
return prompt

from random import randrange

print(format_dolly(dataset[randrange(len(dataset))]))

from transformers import AutoTokenizer

Hugging Face model id

model_id = "meta-llama/Meta-Llama-3-8B" # gated

model_id = "meta-llama/Llama-2-7b-hf" # gated

tokenizer = AutoTokenizer.from_pretrained(model_id)
from random import randint

add utils method to path for loading dataset

import sys
sys.path.append("./scripts/utils") # make sure you change this to the correct path
from pack_dataset import pack_dataset

template dataset to add prompt to each sample

def template_dataset(sample):
sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"
return sample

apply prompt template per sample

dataset = dataset.map(template_dataset, remove_columns=list(dataset.features))

print random sample

print(dataset[randint(0, len(dataset))]["text"])

tokenize dataset

dataset = dataset.map(
lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)
)

chunk dataset

lm_dataset = pack_dataset(dataset, chunk_length=2048) # We use 2048 as the maximum length for packing

save train_dataset to disk

dataset_path = "tokenized_dolly"
lm_dataset.save_to_disk(dataset_path)
Run the above script:

python download_data.py

Compile the finetuning script on inf2.8xlarge with the compile_llama3.sh script

MALLOC_ARENA_MAX=64 neuron_parallel_compile torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--max_steps 10
--logging_steps 10
--gradient_accumulation_steps 16

Run the finetuning on inf2.8xlarge with the run_llama3.sh script

MALLOC_ARENA_MAX=64 torchrun --nproc_per_node=8 scripts/run_clm.py
--model_id "meta-llama/Meta-Llama-3-8B"
--dataset_path "tokenized_dolly"
--bf16 True
--learning_rate 5e-5
--output_dir dolly_llama
--overwrite_output_dir True
--skip_cache_push True
--per_device_train_batch_size 1
--gradient_checkpointing True
--tensor_parallel_size 8
--num_train_epochs 3
--logging_steps 10
--gradient_accumulation_steps 16


### Expected behavior

The run command should give performance numbers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    StalebugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions