Skip to content

Conversation

@slokesha
Copy link
Contributor

@slokesha slokesha commented Jul 22, 2025

What does this PR do?

This PR fixes the performance drop for the Phi model
Added graph break with mark_step for lazy mode and attn_softmax_bf16 flag, which also improves performance in some cases.

HL-SMI Version: hl-1.22.0-rc-fw-
Driver Version: 1.22.0-48ef525
Nic Driver Version: 1.22.0-48ef525

image: artifactory-kfs.habana-labs.com/docker-local/1.22.0/ubuntu22.04/habanalabs/pytorch-installer-2.7.1:1.22.0-543

image

@12010486 12010486 mentioned this pull request Jul 23, 2025
3 tasks
@slokesha slokesha marked this pull request as ready for review July 24, 2025 08:30
@karol-brejna-i
Copy link
Collaborator

@slokesha The PR looks good. Can you share some info on how to validate (like: which test to run).

@12010486
Copy link
Collaborator

12010486 commented Aug 8, 2025

Chiming in.
Tests that were run, and I believe you can use to check, are:

cd optimum-habana/
pip install -e .
python -m pip install .[tests]
pip install pytest
PT_HPU_LAZY_MODE=1 python -m pytest tests/test_text_generation_example.py --device gaudi3 -v -s --token=xxx --junitxml=/tmp/test_run_microsoft_phi.xml --log-cli-level 20 -k test_text_generation_bf16_1x[microsoft/phi-2-1-False-False]

and

PT_HPU_LAZY_MODE=1 python -m pytest tests/test_text_generation_example.py --device gaudi2 -v -s --token=xxx --junitxml=/tmp/test_run_microsoft_phi.xml --log-cli-level 20 -k test_text_generation_fp8[microsoft/phi-2-1-1-True-128-128] 

generation_config.trust_remote_code = args.trust_remote_code
generation_config.valid_sequence_lengths = None
generation_config.attn_batch_split = args.attn_batch_split
generation_config.fp8 = bool(args.quant_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we should add this to the generation config as there could be other ways of running the model in fp8. Maybe we could simply check the value of the env variable QUANT_CONFIG in the modeling file since this is how args.quant_config is set up:

args.quant_config = os.getenv("QUANT_CONFIG", "")

WDYT?

@astachowiczhabana astachowiczhabana removed their assignment Sep 11, 2025
@karol-brejna-i
Copy link
Collaborator

@slokesha Are you still working on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants