huggingface · gplutop7 · Oct 29, 2025
@@ -63,50 +63,6 @@ Inference with FP8 precision is enabled using [Intel Neural Compressor (INC)](ht
 More information on enabling FP8 in SynapseAI is available here:
 [Run Inference Using FP8](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html?highlight=fp8)
 
-### Single card inference with FP8
-Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b with SDPA:
-```bash
-PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b with SDPA:
-```bash
-PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --sdp_on_bf16
-```
-
-### Multi-cards inference with FP8
-Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
-```bash
-PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
-
-Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
-```bash
-PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
-    --model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
-    --image_path "https://llava-vl.github.io/static/images/view.jpg" \
-    --use_hpu_graphs \
-    --bf16 \
-    --use_flash_attention \
-    --flash_attention_recompute
-```
-
 ## LORA Finetune
 
 Here are single-/multi-device command examples for meta-llama/Llama-3.2-11B-Vision-Instruct.

@@ -276,45 +276,6 @@ PT_HPU_LAZY_MODE=1 python run_speech_recognition_seq2seq.py \
 If training on a different language, you should be sure to change the `language` argument. The `language` and `task` arguments should be omitted for English speech recognition.
 
 
-### Multi HPU Whisper Training with Seq2Seq
-The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision:
-```bash
-PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
-    --world_size 8 --use_mpi run_speech_recognition_seq2seq.py \
-    --model_name_or_path="openai/whisper-large" \
-    --dataset_name="mozilla-foundation/common_voice_11_0" \
-    --trust_remote_code \
-    --dataset_config_name="hi" \
-    --language="hindi" \
-    --task="transcribe" \
-    --train_split_name="train+validation" \
-    --eval_split_name="test" \
-    --gaudi_config_name="Habana/whisper" \
-    --max_steps="625" \
-    --output_dir="/tmp/whisper-large-hi" \
-    --per_device_train_batch_size="16" \
-    --per_device_eval_batch_size="2" \
-    --logging_steps="25" \
-    --learning_rate="1e-5" \
-    --generation_max_length="225" \
-    --preprocessing_num_workers="1" \
-    --max_duration_in_seconds="30" \
-    --text_column_name="sentence" \
-    --freeze_feature_encoder="False" \
-    --sdp_on_bf16 \
-    --bf16 \
-    --overwrite_output_dir \
-    --do_train \
-    --do_eval \
-    --predict_with_generate \
-    --use_habana \
-    --use_hpu_graphs_for_inference \
-    --label_features_max_length 128 \
-    --dataloader_num_workers 8 \
-    --gradient_checkpointing \
-    --throughput_warmup_steps 3
-```
-
 #### Single HPU Seq2Seq Inference
 
 The following example shows how to do inference with the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 1 HPU devices in half-precision:

@@ -98,24 +98,6 @@ PT_HPU_LAZY_MODE=1 python run_generation.py \
 
 > The batch size should be larger than or equal to the number of prompts. Otherwise, only the first N prompts are kept with N being equal to the batch size.
 
-### Run Speculative Sampling on Gaudi
-
-If you want to generate a sequence of text from a prompt of your choice using assisted decoding, you can use the following command as an example:
-
-```bash
-PT_HPU_LAZY_MODE=1 python run_generation.py \
---model_name_or_path gpt2 \
---assistant_model distilgpt2 \
---batch_size 1 \
---max_new_tokens 100 \
---use_hpu_graphs \
---use_kv_cache \
---num_return_sequences 1 \
---temperature 0 \
---prompt "Alice and Bob" \
---sdp_on_bf16
-```
-
 ### Benchmark
 
 The default behaviour of this script (i.e. if no dataset is specified with `--dataset_name`) is to benchmark the given model with a few pre-defined prompts or with the prompt you gave with `--prompt`.
@@ -146,21 +128,6 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_g
 --sdp_on_bf16
 ```
 
-To run Llama3-405B inference on 8 Gaudi3 cards use the following command:
-```bash
-PT_HPU_LAZY_MODE=1 ENABLE_LB_BUNDLE_ALL_COMPUTE_MME=0 ENABLE_EXPERIMENTAL_FLAGS=1 \
-python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
---model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
---max_new_tokens 2048 \
---bf16 \
---use_hpu_graphs \
---use_kv_cache \
---batch_size 1 \
---do_sample \
---use_flash_attention \
---flash_attention_causal_mask
-```
-
 To run Deepseek-R1-BF16 inference on 16 Gaudi3 cards (2 nodes) use the following command. Ensure you replace the hostfile parameter with the appropriate file. Sample hostfile reference [here](/examples/multi-node-training/hostfile)
 
 > NOTE: This is an experimental support currently. Due to memory constraints, BS=1 is only supported for now.