diff --git a/examples/image-to-text/README.md b/examples/image-to-text/README.md index 776bc8330d..128b5cd133 100644 --- a/examples/image-to-text/README.md +++ b/examples/image-to-text/README.md @@ -63,50 +63,6 @@ Inference with FP8 precision is enabled using [Intel Neural Compressor (INC)](ht More information on enabling FP8 in SynapseAI is available here: [Run Inference Using FP8](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html?highlight=fp8) -### Single card inference with FP8 -Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b with SDPA: -```bash -PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b with SDPA: -```bash -PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --sdp_on_bf16 -``` - -### Multi-cards inference with FP8 -Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs: -```bash -PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` - -Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs: -```bash -PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \ - --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \ - --image_path "https://llava-vl.github.io/static/images/view.jpg" \ - --use_hpu_graphs \ - --bf16 \ - --use_flash_attention \ - --flash_attention_recompute -``` - ## LORA Finetune Here are single-/multi-device command examples for meta-llama/Llama-3.2-11B-Vision-Instruct. diff --git a/examples/speech-recognition/README.md b/examples/speech-recognition/README.md index da53e2f226..1937415878 100644 --- a/examples/speech-recognition/README.md +++ b/examples/speech-recognition/README.md @@ -276,45 +276,6 @@ PT_HPU_LAZY_MODE=1 python run_speech_recognition_seq2seq.py \ If training on a different language, you should be sure to change the `language` argument. The `language` and `task` arguments should be omitted for English speech recognition. -### Multi HPU Whisper Training with Seq2Seq -The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision: -```bash -PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \ - --world_size 8 --use_mpi run_speech_recognition_seq2seq.py \ - --model_name_or_path="openai/whisper-large" \ - --dataset_name="mozilla-foundation/common_voice_11_0" \ - --trust_remote_code \ - --dataset_config_name="hi" \ - --language="hindi" \ - --task="transcribe" \ - --train_split_name="train+validation" \ - --eval_split_name="test" \ - --gaudi_config_name="Habana/whisper" \ - --max_steps="625" \ - --output_dir="/tmp/whisper-large-hi" \ - --per_device_train_batch_size="16" \ - --per_device_eval_batch_size="2" \ - --logging_steps="25" \ - --learning_rate="1e-5" \ - --generation_max_length="225" \ - --preprocessing_num_workers="1" \ - --max_duration_in_seconds="30" \ - --text_column_name="sentence" \ - --freeze_feature_encoder="False" \ - --sdp_on_bf16 \ - --bf16 \ - --overwrite_output_dir \ - --do_train \ - --do_eval \ - --predict_with_generate \ - --use_habana \ - --use_hpu_graphs_for_inference \ - --label_features_max_length 128 \ - --dataloader_num_workers 8 \ - --gradient_checkpointing \ - --throughput_warmup_steps 3 -``` - #### Single HPU Seq2Seq Inference The following example shows how to do inference with the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 1 HPU devices in half-precision: diff --git a/examples/text-generation/README.md b/examples/text-generation/README.md index e5dec80943..c3530f1c32 100755 --- a/examples/text-generation/README.md +++ b/examples/text-generation/README.md @@ -98,24 +98,6 @@ PT_HPU_LAZY_MODE=1 python run_generation.py \ > The batch size should be larger than or equal to the number of prompts. Otherwise, only the first N prompts are kept with N being equal to the batch size. -### Run Speculative Sampling on Gaudi - -If you want to generate a sequence of text from a prompt of your choice using assisted decoding, you can use the following command as an example: - -```bash -PT_HPU_LAZY_MODE=1 python run_generation.py \ ---model_name_or_path gpt2 \ ---assistant_model distilgpt2 \ ---batch_size 1 \ ---max_new_tokens 100 \ ---use_hpu_graphs \ ---use_kv_cache \ ---num_return_sequences 1 \ ---temperature 0 \ ---prompt "Alice and Bob" \ ---sdp_on_bf16 -``` - ### Benchmark The default behaviour of this script (i.e. if no dataset is specified with `--dataset_name`) is to benchmark the given model with a few pre-defined prompts or with the prompt you gave with `--prompt`. @@ -146,21 +128,6 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_g --sdp_on_bf16 ``` -To run Llama3-405B inference on 8 Gaudi3 cards use the following command: -```bash -PT_HPU_LAZY_MODE=1 ENABLE_LB_BUNDLE_ALL_COMPUTE_MME=0 ENABLE_EXPERIMENTAL_FLAGS=1 \ -python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \ ---model_name_or_path meta-llama/Llama-3.1-405B-Instruct \ ---max_new_tokens 2048 \ ---bf16 \ ---use_hpu_graphs \ ---use_kv_cache \ ---batch_size 1 \ ---do_sample \ ---use_flash_attention \ ---flash_attention_causal_mask -``` - To run Deepseek-R1-BF16 inference on 16 Gaudi3 cards (2 nodes) use the following command. Ensure you replace the hostfile parameter with the appropriate file. Sample hostfile reference [here](/examples/multi-node-training/hostfile) > NOTE: This is an experimental support currently. Due to memory constraints, BS=1 is only supported for now.