[model] support NVIDIA's Audio-Flamingo-3 audio model #9740

vovanphuc · 2026-01-09T07:20:18Z

Summary

Add support for NVIDIA's Audio-Flamingo-3 (AF3), a state-of-the-art Large Audio-Language Model for speech, sound, and music understanding.

Changes

Add AudioFlamingo3Plugin class for audio feature extraction and processing
Add audio_flamingo_3 chat template with ChatML-style format
Register Audio-Flamingo-3 model group with multimodal=True
Register composite model for LoRA freeze support on audio encoder
Add Audio-Flamingo-3 to supported models table in README

Supported Models

Model	HuggingFace ID
Audio-Flamingo-3	`nvidia/audio-flamingo-3-hf`

Key Features

Audio placeholder mapping: Automatically converts <audio> → <sound> for dataset compatibility
Feature extraction: Uses AF-Whisper encoder with proper input_features_mask handling
Token expansion: Delegated to processor for correct 30-second window calculation
LoRA support: Composite model registration enables freezing audio encoder during training

Model Specifications

Architecture: Qwen2.5-7B backbone + AF-Whisper audio encoder (8B params)
Audio context: Up to 10 minutes (processed in 30-second windows)
Capabilities: Transcription, audio QA, captioning, chain-of-thought reasoning

Test plan

Model loads with trust_remote_code=True
Template encoding verified
Audio feature extraction working
Pre-commit checks pass (make style && make quality)

References

gemini-code-assist · 2026-01-09T07:20:37Z

Summary of Changes

Hello @vovanphuc, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates NVIDIA's Audio-Flamingo-3, a cutting-edge Large Audio-Language Model, into the system. This enhancement introduces robust capabilities for understanding and processing audio data, including speech, sound, and music. The changes encompass the necessary infrastructure for audio feature extraction, a dedicated chat templating system, and support for fine-tuning the audio encoder using LoRA.

Highlights

AudioFlamingo3Plugin Implementation: A new AudioFlamingo3Plugin class has been added to handle audio feature extraction and processing specifically for the Audio-Flamingo-3 model, including managing input_features_mask.
Chat Template Registration: A dedicated audio_flamingo_3 chat template has been registered, following a ChatML-style format, to ensure proper message formatting for the new model.
Model Group Registration: Audio-Flamingo-3 has been registered as a multimodal model group, linking it to its HuggingFace ID and the newly created chat template.
LoRA Support for Audio Encoder: A composite model for audioflamingo3 has been registered, enabling LoRA (Low-Rank Adaptation) support for fine-tuning its audio encoder.
README Updates: The README.md and README_zh.md files have been updated to include Audio-Flamingo-3 in the list of supported models.
Audio Placeholder Mapping: The AudioFlamingo3Plugin automatically converts <audio> placeholders to <sound> to maintain compatibility with existing datasets while adhering to AF3's expected placeholder format.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for NVIDIA's Audio-Flamingo-3 model. The changes are well-structured and include a new plugin for audio processing, a chat template, and necessary registrations for the model group and LoRA support. The implementation correctly handles audio feature extraction, placeholder mapping, and token expansion delegation as described. I have a few minor suggestions to improve documentation accuracy and code clarity.

gemini-code-assist · 2026-01-09T07:22:08Z

README.md


 | Model                                                             | Model size                       | Template             |
 | ----------------------------------------------------------------- | -------------------------------- | -------------------- |
+| [Audio-Flamingo-3](https://huggingface.co/nvidia)                 | 8B                               | audio_flamingo_3     |


The link for Audio-Flamingo-3 points to the NVIDIA organization page. It would be more helpful to link directly to the model's Hugging Face page for easier access to model details.

Suggested change

| [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 |

| [Audio-Flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3-hf) | 8B | audio_flamingo_3 |

gemini-code-assist · 2026-01-09T07:22:08Z

README_zh.md


 | 模型名                                                             | 参数量                            | Template             |
 | ----------------------------------------------------------------- | -------------------------------- | -------------------- |
+| [Audio-Flamingo-3](https://huggingface.co/nvidia)                 | 8B                               | audio_flamingo_3     |


The link for Audio-Flamingo-3 points to the NVIDIA organization page. It would be more helpful to link directly to the model's Hugging Face page for easier access to model details.

Suggested change

| [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 |

| [Audio-Flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3-hf) | 8B | audio_flamingo_3 |

gemini-code-assist · 2026-01-09T07:22:08Z

src/llamafactory/data/mm_plugin.py

+        if len(audios) != 0:
+            audios = self._regularize_audios(
+                audios,
+                sampling_rate=getattr(processor, "audio_sampling_rate", 16000),
+            )["audios"]
+            mm_inputs.update(
+                feature_extractor(
+                    audios,
+                    sampling_rate=getattr(processor, "audio_sampling_rate", 16000),
+                    return_attention_mask=True,
+                    padding="max_length",
+                    return_tensors="pt",
+                )
+            )


This code block can be refactored for improved readability and to avoid a redundant getattr call. Storing sampling_rate in a variable is more efficient. Using if audios: is a more Pythonic way to check for a non-empty list. Additionally, renaming the audios variable after regularization to regularized_audios prevents variable shadowing and enhances clarity.

Suggested change

if len(audios) != 0:

audios = self._regularize_audios(

audios,

sampling_rate=getattr(processor, "audio_sampling_rate", 16000),

)["audios"]

mm_inputs.update(

feature_extractor(

audios,

sampling_rate=getattr(processor, "audio_sampling_rate", 16000),

return_attention_mask=True,

padding="max_length",

return_tensors="pt",

)

)

if audios:

sampling_rate = getattr(processor, "audio_sampling_rate", 16000)

regularized_audios = self._regularize_audios(

audios,

sampling_rate=sampling_rate,

)["audios"]

mm_inputs.update(

feature_extractor(

regularized_audios,

sampling_rate=sampling_rate,

return_attention_mask=True,

padding="max_length",

return_tensors="pt",

)

)

The original implementation had expand_mm_tokens=False which caused only 1 <sound> token in input while AF3 encoder produces N embeddings. This fix: - Enables proper token expansion (expand_mm_tokens=True default) - Implements windowing for audio >30s (30-second chunks, max 10 min) - Calculates correct token count using AF3's downsampling formula: conv_output_len = (mel_len - 1) // 2 + 1 audio_tokens_len = (conv_output_len - 2) // 2 + 1

gemini-code-assist bot reviewed Jan 9, 2026

View reviewed changes

vovanphuc force-pushed the feature/add-audio-flamingo-3-support branch from 3e6e78f to ea7d494 Compare January 9, 2026 07:29

hiyouga force-pushed the main branch from 38186fd to 8abb8fb Compare January 9, 2026 08:12

vovanphuc temporarily deployed to docker January 9, 2026 08:17 — with GitHub Actions Inactive

vovanphuc force-pushed the feature/add-audio-flamingo-3-support branch from c87c128 to 6c8ad8c Compare January 9, 2026 09:40

vovanphuc added 7 commits January 14, 2026 09:28

[model] add Audio-Flamingo-3 model group registration

f54b45d

[data] add audio_flamingo_3 template registration

aa80faf

[data] add AudioFlamingo3Plugin for audio processing

c2def17

[data] register AudioFlamingo3Plugin in PLUGINS dict

d5e0063

[model] register audioflamingo3 in COMPOSITE_MODELS

241aff7

[docs] add Audio-Flamingo-3 to supported models

190282e

vovanphuc force-pushed the feature/add-audio-flamingo-3-support branch from 6c8ad8c to 8a5ce59 Compare January 14, 2026 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[model] support NVIDIA's Audio-Flamingo-3 audio model #9740

[model] support NVIDIA's Audio-Flamingo-3 audio model #9740

vovanphuc commented Jan 9, 2026

Uh oh!

gemini-code-assist bot commented Jan 9, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

gemini-code-assist bot Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| [Audio-Flamingo-3](https://huggingface.co/nvidia) \| 8B \| audio_flamingo_3 \|
	\| [Audio-Flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3-hf) \| 8B \| audio_flamingo_3 \|

[model] support NVIDIA's Audio-Flamingo-3 audio model #9740

Are you sure you want to change the base?

[model] support NVIDIA's Audio-Flamingo-3 audio model #9740

Conversation

vovanphuc commented Jan 9, 2026

Summary

Changes

Supported Models

Key Features

Model Specifications

Test plan

References

Uh oh!

gemini-code-assist bot commented Jan 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant