Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 0 additions & 44 deletions examples/image-to-text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,50 +63,6 @@ Inference with FP8 precision is enabled using [Intel Neural Compressor (INC)](ht
More information on enabling FP8 in SynapseAI is available here:
[Run Inference Using FP8](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Quantization/Inference_Using_FP8.html?highlight=fp8)

### Single card inference with FP8
Here is an example to measure the tensor quantization statistics on Llava-v1.6-vicuna-13b with SDPA:
```bash
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--sdp_on_bf16
```

Here is an example to quantize the model based on previous measurements for Llava-v1.6-vicuna-13b with SDPA:
```bash
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-vicuna-13b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--sdp_on_bf16
```

### Multi-cards inference with FP8
Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
```bash
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--use_flash_attention \
--flash_attention_recompute
```

Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b with FusedSDPA on 8 HPUs:
```bash
PT_HPU_LAZY_MODE=1 QUANT_CONFIG=./quantization_config/maxabs_quant_scale_format_const.json python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
--use_hpu_graphs \
--bf16 \
--use_flash_attention \
--flash_attention_recompute
```

## LORA Finetune

Here are single-/multi-device command examples for meta-llama/Llama-3.2-11B-Vision-Instruct.
Expand Down
39 changes: 0 additions & 39 deletions examples/speech-recognition/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -276,45 +276,6 @@ PT_HPU_LAZY_MODE=1 python run_speech_recognition_seq2seq.py \
If training on a different language, you should be sure to change the `language` argument. The `language` and `task` arguments should be omitted for English speech recognition.


### Multi HPU Whisper Training with Seq2Seq
The following example shows how to fine-tune the [Whisper large](https://huggingface.co/openai/whisper-large) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 8 HPU devices in half-precision:
```bash
PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py \
--world_size 8 --use_mpi run_speech_recognition_seq2seq.py \
--model_name_or_path="openai/whisper-large" \
--dataset_name="mozilla-foundation/common_voice_11_0" \
--trust_remote_code \
--dataset_config_name="hi" \
--language="hindi" \
--task="transcribe" \
--train_split_name="train+validation" \
--eval_split_name="test" \
--gaudi_config_name="Habana/whisper" \
--max_steps="625" \
--output_dir="/tmp/whisper-large-hi" \
--per_device_train_batch_size="16" \
--per_device_eval_batch_size="2" \
--logging_steps="25" \
--learning_rate="1e-5" \
--generation_max_length="225" \
--preprocessing_num_workers="1" \
--max_duration_in_seconds="30" \
--text_column_name="sentence" \
--freeze_feature_encoder="False" \
--sdp_on_bf16 \
--bf16 \
--overwrite_output_dir \
--do_train \
--do_eval \
--predict_with_generate \
--use_habana \
--use_hpu_graphs_for_inference \
--label_features_max_length 128 \
--dataloader_num_workers 8 \
--gradient_checkpointing \
--throughput_warmup_steps 3
```

#### Single HPU Seq2Seq Inference

The following example shows how to do inference with the [Whisper small](https://huggingface.co/openai/whisper-small) checkpoint on the Hindi subset of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) using 1 HPU devices in half-precision:
Expand Down
33 changes: 0 additions & 33 deletions examples/text-generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,24 +98,6 @@ PT_HPU_LAZY_MODE=1 python run_generation.py \

> The batch size should be larger than or equal to the number of prompts. Otherwise, only the first N prompts are kept with N being equal to the batch size.

### Run Speculative Sampling on Gaudi

If you want to generate a sequence of text from a prompt of your choice using assisted decoding, you can use the following command as an example:

```bash
PT_HPU_LAZY_MODE=1 python run_generation.py \
--model_name_or_path gpt2 \
--assistant_model distilgpt2 \
--batch_size 1 \
--max_new_tokens 100 \
--use_hpu_graphs \
--use_kv_cache \
--num_return_sequences 1 \
--temperature 0 \
--prompt "Alice and Bob" \
--sdp_on_bf16
```

### Benchmark

The default behaviour of this script (i.e. if no dataset is specified with `--dataset_name`) is to benchmark the given model with a few pre-defined prompts or with the prompt you gave with `--prompt`.
Expand Down Expand Up @@ -146,21 +128,6 @@ PT_HPU_LAZY_MODE=1 python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_g
--sdp_on_bf16
```

To run Llama3-405B inference on 8 Gaudi3 cards use the following command:
```bash
PT_HPU_LAZY_MODE=1 ENABLE_LB_BUNDLE_ALL_COMPUTE_MME=0 ENABLE_EXPERIMENTAL_FLAGS=1 \
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-3.1-405B-Instruct \
--max_new_tokens 2048 \
--bf16 \
--use_hpu_graphs \
--use_kv_cache \
--batch_size 1 \
--do_sample \
--use_flash_attention \
--flash_attention_causal_mask
```

To run Deepseek-R1-BF16 inference on 16 Gaudi3 cards (2 nodes) use the following command. Ensure you replace the hostfile parameter with the appropriate file. Sample hostfile reference [here](/examples/multi-node-training/hostfile)

> NOTE: This is an experimental support currently. Due to memory constraints, BS=1 is only supported for now.
Expand Down