fix: fp8 rollout nightly fix check from step 100 to 40 #1233

terrykong · 2025-09-30T01:16:11Z

A fix missed in my earlier commit: #1231

tested locally and passes

Summary by CodeRabbit

Tests
- Adjusted validation to check token probability error thresholds at step 40 (previously 100) for LLM rollouts. Maintains existing gating on maximum step completion while improving CI reliability and reducing test flakiness. No impact on model behavior or user features.
Chores
- Operational test tuning only; no user-facing functionality, performance, or UI changes.

Signed-off-by: Terry Kong <[email protected]>

coderabbitai · 2025-09-30T01:20:32Z

📝 Walkthrough

Walkthrough

Adjusts a test assertion in the GRPO Llama3.1 8B Megatron FP8 rollouts script to compare the token_mult_prob_error threshold at index 40 instead of 100, with no other logic or command sequence changes.

Changes

Cohort / File(s)	Change Summary
Test validation index adjustment `tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh`	Updated the checked index for train token_mult_prob_error threshold from 100 to 40; conditional checks and flow unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

fix: grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts runs 40 steps #1231 — Adjusts the same test to run 40 steps (reducing STEPS_PER_RUN/MAX_STEPS), directly aligning with this PR’s shift of the assertion index to 40.

Suggested labels

CI:L0

Suggested reviewers

guyueh1
chtruong814

Pre-merge checks and finishing touches

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title concisely describes the primary change of updating the nightly FP8 rollout verification check from step 100 to step 40, matching the modifications in the test script without extraneous detail or noise.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.
Test Results For Major Changes	✅ Passed	The change only adjusts the test suite’s assertion index from step 100 to step 40 to align with the configured MAX_STEPS value and does not alter any production code, numerical logic, or performance-critical paths; this is a minor test script correction that has been verified locally and therefore meets the criteria for a pass without requiring additional documented test results.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/fp8-another-fix

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c2b36f2 and f36f212.

📒 Files selected for processing (1)

tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.sh: Follow the Google Shell Style Guide for all shell scripts
Use uv run to execute Python scripts in shell/driver scripts instead of activating virtualenvs and calling python directly
Add the NVIDIA copyright header (with current year) at the top of all shell scripts, excluding tests/ and test-only scripts

Files:

tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh

tests/test_suites/llm/*.sh

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM driver script filenames must mirror the YAML base name and follow the same pattern with .sh extension

Files:

tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh

tests/test_suites/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place driver shell scripts and common.env under tests/test_suites// and list nightly tests in tests/test_suites/nightly.txt

Files:

tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Docs_Tests
GitHub Check: Lint check
GitHub Check: Post automodel integration comment / Comment on PR
GitHub Check: Post submodule check comment / Comment on PR

🔇 Additional comments (1)

tests/test_suites/llm/grpo-llama3.1-8b-instruct-1n8g-megatron-fp8-rollouts.v2.sh (1)

38-38: LGTM! Fix aligns assertion with MAX_STEPS.

The change correctly updates the index from 100 to 40, matching the MAX_STEPS=40 configuration. The previous value would have referenced a non-existent step since training only runs to step 40.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fix: fp8 rollout nightly fix check from step 100 to 40

f36f212

Signed-off-by: Terry Kong <[email protected]>

terrykong requested review from guyueh1 and chtruong814 September 30, 2025 01:16

terrykong requested a review from a team as a code owner September 30, 2025 01:16

terrykong added the CI:docs Run doctest label Sep 30, 2025

terrykong enabled auto-merge (squash) September 30, 2025 01:16

terrykong temporarily deployed to nemo-ci September 30, 2025 01:16 — with GitHub Actions Inactive

terrykong temporarily deployed to nemo-ci September 30, 2025 01:17 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fp8 rollout nightly fix check from step 100 to 40 #1233

fix: fp8 rollout nightly fix check from step 100 to 40 #1233

Uh oh!

terrykong commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 30, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

fix: fp8 rollout nightly fix check from step 100 to 40 #1233

Are you sure you want to change the base?

fix: fp8 rollout nightly fix check from step 100 to 40 #1233

Uh oh!

Conversation

terrykong commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

terrykong commented Sep 30, 2025 •

edited by coderabbitai bot

Loading