-
Notifications
You must be signed in to change notification settings - Fork 8k
[model] support NVIDIA's Audio-Flamingo-3 audio model #9740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[model] support NVIDIA's Audio-Flamingo-3 audio model #9740
Conversation
Summary of ChangesHello @vovanphuc, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates NVIDIA's Audio-Flamingo-3, a cutting-edge Large Audio-Language Model, into the system. This enhancement introduces robust capabilities for understanding and processing audio data, including speech, sound, and music. The changes encompass the necessary infrastructure for audio feature extraction, a dedicated chat templating system, and support for fine-tuning the audio encoder using LoRA. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for NVIDIA's Audio-Flamingo-3 model. The changes are well-structured and include a new plugin for audio processing, a chat template, and necessary registrations for the model group and LoRA support. The implementation correctly handles audio feature extraction, placeholder mapping, and token expansion delegation as described. I have a few minor suggestions to improve documentation accuracy and code clarity.
README.md
Outdated
|
|
||
| | Model | Model size | Template | | ||
| | ----------------------------------------------------------------- | -------------------------------- | -------------------- | | ||
| | [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link for Audio-Flamingo-3 points to the NVIDIA organization page. It would be more helpful to link directly to the model's Hugging Face page for easier access to model details.
| | [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 | | |
| | [Audio-Flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3-hf) | 8B | audio_flamingo_3 | |
README_zh.md
Outdated
|
|
||
| | 模型名 | 参数量 | Template | | ||
| | ----------------------------------------------------------------- | -------------------------------- | -------------------- | | ||
| | [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link for Audio-Flamingo-3 points to the NVIDIA organization page. It would be more helpful to link directly to the model's Hugging Face page for easier access to model details.
| | [Audio-Flamingo-3](https://huggingface.co/nvidia) | 8B | audio_flamingo_3 | | |
| | [Audio-Flamingo-3](https://huggingface.co/nvidia/audio-flamingo-3-hf) | 8B | audio_flamingo_3 | |
| if len(audios) != 0: | ||
| audios = self._regularize_audios( | ||
| audios, | ||
| sampling_rate=getattr(processor, "audio_sampling_rate", 16000), | ||
| )["audios"] | ||
| mm_inputs.update( | ||
| feature_extractor( | ||
| audios, | ||
| sampling_rate=getattr(processor, "audio_sampling_rate", 16000), | ||
| return_attention_mask=True, | ||
| padding="max_length", | ||
| return_tensors="pt", | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block can be refactored for improved readability and to avoid a redundant getattr call. Storing sampling_rate in a variable is more efficient. Using if audios: is a more Pythonic way to check for a non-empty list. Additionally, renaming the audios variable after regularization to regularized_audios prevents variable shadowing and enhances clarity.
| if len(audios) != 0: | |
| audios = self._regularize_audios( | |
| audios, | |
| sampling_rate=getattr(processor, "audio_sampling_rate", 16000), | |
| )["audios"] | |
| mm_inputs.update( | |
| feature_extractor( | |
| audios, | |
| sampling_rate=getattr(processor, "audio_sampling_rate", 16000), | |
| return_attention_mask=True, | |
| padding="max_length", | |
| return_tensors="pt", | |
| ) | |
| ) | |
| if audios: | |
| sampling_rate = getattr(processor, "audio_sampling_rate", 16000) | |
| regularized_audios = self._regularize_audios( | |
| audios, | |
| sampling_rate=sampling_rate, | |
| )["audios"] | |
| mm_inputs.update( | |
| feature_extractor( | |
| regularized_audios, | |
| sampling_rate=sampling_rate, | |
| return_attention_mask=True, | |
| padding="max_length", | |
| return_tensors="pt", | |
| ) | |
| ) |
3e6e78f to
ea7d494
Compare
c87c128 to
6c8ad8c
Compare
The original implementation had expand_mm_tokens=False which caused only 1 <sound> token in input while AF3 encoder produces N embeddings. This fix: - Enables proper token expansion (expand_mm_tokens=True default) - Implements windowing for audio >30s (30-second chunks, max 10 min) - Calculates correct token count using AF3's downsampling formula: conv_output_len = (mel_len - 1) // 2 + 1 audio_tokens_len = (conv_output_len - 2) // 2 + 1
6c8ad8c to
8a5ce59
Compare
Summary
Add support for NVIDIA's Audio-Flamingo-3 (AF3), a state-of-the-art Large Audio-Language Model for speech, sound, and music understanding.
Changes
AudioFlamingo3Pluginclass for audio feature extraction and processingaudio_flamingo_3chat template with ChatML-style formatmultimodal=TrueSupported Models
nvidia/audio-flamingo-3-hfKey Features
<audio>→<sound>for dataset compatibilityinput_features_maskhandlingModel Specifications
Test plan
trust_remote_code=Truemake style && make quality)References