Skip to content

Conversation

@vovanphuc
Copy link
Contributor

Summary

Add support for NVIDIA's Audio-Flamingo-3 (AF3), a state-of-the-art Large Audio-Language Model for speech, sound, and music understanding.

Changes

  • Add AudioFlamingo3Plugin class for audio feature extraction and processing
  • Add audio_flamingo_3 chat template with ChatML-style format
  • Register Audio-Flamingo-3 model group with multimodal=True
  • Register composite model for LoRA freeze support on audio encoder
  • Add Audio-Flamingo-3 to supported models table in README

Supported Models

Model HuggingFace ID
Audio-Flamingo-3 nvidia/audio-flamingo-3-hf

Key Features

  • Audio placeholder mapping: Automatically converts <audio><sound> for dataset compatibility
  • Feature extraction: Uses AF-Whisper encoder with proper input_features_mask handling
  • Token expansion: Delegated to processor for correct 30-second window calculation
  • LoRA support: Composite model registration enables freezing audio encoder during training

Model Specifications

  • Architecture: Qwen2.5-7B backbone + AF-Whisper audio encoder (8B params)
  • Audio context: Up to 10 minutes (processed in 30-second windows)
  • Capabilities: Transcription, audio QA, captioning, chain-of-thought reasoning

Test plan

  • Model loads with trust_remote_code=True
  • Template encoding verified
  • Audio feature extraction working
  • Pre-commit checks pass (make style && make quality)

References

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @vovanphuc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates NVIDIA's Audio-Flamingo-3, a cutting-edge Large Audio-Language Model, into the system. This enhancement introduces robust capabilities for understanding and processing audio data, including speech, sound, and music. The changes encompass the necessary infrastructure for audio feature extraction, a dedicated chat templating system, and support for fine-tuning the audio encoder using LoRA.

Highlights

  • AudioFlamingo3Plugin Implementation: A new AudioFlamingo3Plugin class has been added to handle audio feature extraction and processing specifically for the Audio-Flamingo-3 model, including managing input_features_mask.
  • Chat Template Registration: A dedicated audio_flamingo_3 chat template has been registered, following a ChatML-style format, to ensure proper message formatting for the new model.
  • Model Group Registration: Audio-Flamingo-3 has been registered as a multimodal model group, linking it to its HuggingFace ID and the newly created chat template.
  • LoRA Support for Audio Encoder: A composite model for audioflamingo3 has been registered, enabling LoRA (Low-Rank Adaptation) support for fine-tuning its audio encoder.
  • README Updates: The README.md and README_zh.md files have been updated to include Audio-Flamingo-3 in the list of supported models.
  • Audio Placeholder Mapping: The AudioFlamingo3Plugin automatically converts <audio> placeholders to <sound> to maintain compatibility with existing datasets while adhering to AF3's expected placeholder format.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for NVIDIA's Audio-Flamingo-3 model. The changes are well-structured and include a new plugin for audio processing, a chat template, and necessary registrations for the model group and LoRA support. The implementation correctly handles audio feature extraction, placeholder mapping, and token expansion delegation as described. I have a few minor suggestions to improve documentation accuracy and code clarity.

README.md Outdated

| Model | Model size | Template |
| ----------------------------------------------------------------- | -------------------------------- | -------------------- |
| [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link for Audio-Flamingo-3 points to the NVIDIA organization page. It would be more helpful to link directly to the model's Hugging Face page for easier access to model details.

Suggested change
| [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 |
| [Audio-Flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3-hf) | 8B | audio_flamingo_3 |

README_zh.md Outdated

| 模型名 | 参数量 | Template |
| ----------------------------------------------------------------- | -------------------------------- | -------------------- |
| [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link for Audio-Flamingo-3 points to the NVIDIA organization page. It would be more helpful to link directly to the model's Hugging Face page for easier access to model details.

Suggested change
| [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 |
| [Audio-Flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3-hf) | 8B | audio_flamingo_3 |

Comment on lines 1487 to 1510
if len(audios) != 0:
audios = self._regularize_audios(
audios,
sampling_rate=getattr(processor, "audio_sampling_rate", 16000),
)["audios"]
mm_inputs.update(
feature_extractor(
audios,
sampling_rate=getattr(processor, "audio_sampling_rate", 16000),
return_attention_mask=True,
padding="max_length",
return_tensors="pt",
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This code block can be refactored for improved readability and to avoid a redundant getattr call. Storing sampling_rate in a variable is more efficient. Using if audios: is a more Pythonic way to check for a non-empty list. Additionally, renaming the audios variable after regularization to regularized_audios prevents variable shadowing and enhances clarity.

Suggested change
if len(audios) != 0:
audios = self._regularize_audios(
audios,
sampling_rate=getattr(processor, "audio_sampling_rate", 16000),
)["audios"]
mm_inputs.update(
feature_extractor(
audios,
sampling_rate=getattr(processor, "audio_sampling_rate", 16000),
return_attention_mask=True,
padding="max_length",
return_tensors="pt",
)
)
if audios:
sampling_rate = getattr(processor, "audio_sampling_rate", 16000)
regularized_audios = self._regularize_audios(
audios,
sampling_rate=sampling_rate,
)["audios"]
mm_inputs.update(
feature_extractor(
regularized_audios,
sampling_rate=sampling_rate,
return_attention_mask=True,
padding="max_length",
return_tensors="pt",
)
)

The original implementation had expand_mm_tokens=False which caused
only 1 <sound> token in input while AF3 encoder produces N embeddings.
This fix:
- Enables proper token expansion (expand_mm_tokens=True default)
- Implements windowing for audio >30s (30-second chunks, max 10 min)
- Calculates correct token count using AF3's downsampling formula:
  conv_output_len = (mel_len - 1) // 2 + 1
  audio_tokens_len = (conv_output_len - 2) // 2 + 1
@vovanphuc vovanphuc force-pushed the feature/add-audio-flamingo-3-support branch from 6c8ad8c to 8a5ce59 Compare January 14, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant