Speech and audio papers@Top Conference

Hi there! If you think this program is useful, welcome to star⭐. If you want to add some, don't hesitate to PR👆 or email📧 me([email protected])

🔥 NEW UPDATE: 31 Jan, 2025. 新年快乐！

🎉 [01/23/2025] UPDATE ICLR 2025 conference papers successfully!

🎉 [01/23/2025] UPDATE ICLR 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE ICML 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE NeurIPS 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICML 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE NeurIPS 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE ACMMM 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICLR 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE AAAI 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE ACL 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE EMNLP 2024 conference papers successfully!

Speech and audio papers@Top Conference

ICLR'25
- Speech
- Audio
- Summary
ICLR'24
- Speech
- Audio
- Summary
ICML'24
- Speech
- Audio
NeurIPS'24
- Speech
- Audio
ICML'23
- Speech
- Audio
NeurIPS'23
- Speech
- Audio
ACMMM'24
- Speech
- Audio
ICLR'23
- Speech
- Audio
AAAI'24
- Speech
- Audio
ACL'24
- Speech
- Audio
EMNLP'24
- Speech
- Audio
Useful Survey & Awesome Link

ICLR'25

ICLR'25 total submission: 11672; accepted: 3706 (31.75%)

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 100+; We select 49 papers.

re denotes rejected. con denotes conditionalonethicsreview. The numbers like 5668 denotes the detailed rate is 5,6,6,8.

Paper	Status	Average rate
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation	con	8.50
Co$^{\mathbf{3}}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion		7.50
Scaling Transformers for Low-Bitrate High-Quality Speech Coding		7.00
Context-aware Dynamic Pruning for Speech Foundation Models		7.00
Scaling Speech-Text Pre-training with Synthetic Interleaved Data	con	7.00
CR-CTC: Consistency regularization on CTC for improved speech recognition		6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio		6.75
Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive Speech Recognition		6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation		6.75
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity		6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators		6.75
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis		6.67
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation		6.50
LLaMA-Omni: Seamless Speech Interaction with Large Language Models		6.50
Objective Soups: Multilingual Multi-Task Acoustic Modeling for Automatic Speech Recognition	not accepted but rate is good	6.50
SyllableLM: Learning Coarse Semantic Units for Speech Language Models		6.50
Improving Semantic Understanding in Speech Language Models via Brain-tuning		6.50
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios		6.50
Bridging the Data Provenance Gap Across Text, Speech, and Video		6.50
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis		6.40
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors		6.25
T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning		6.25
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation		6.25
GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling		6.00
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation		6.00
FIRING-Net: A filtered feature recycling network for speech enhancement		6.00
TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation		5.83
NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data	55568, rejected	5.80
Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis	5666, rejected	5.75
VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication	5666, rejected	5.75
Speech Robust Bench: A Robustness Benchmark For Speech Recognition	5666,accepted	5.75
OTTC: A differentiable alignment approach to automatic speech recognition	368, rejected	5.68
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Toward Cutting-Edge Speech Generation Methods	566, rejected	5.67
Realistic-Gesture: Co-Speech Gesture Video Generation through Semantic-aware Gesture Representation	35668, rejected	5.60
A$^2$-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching	3568, rejected	5.50
Representing speech through autoregressive prediction of cochlear tokens	5566, rejected	5.50
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching	3568, rejected, but have big influence!	5.50
ASROB: Measuring Automatic Speech Recognition from One Book	3568, rejected,	5.50
SSR: Alignment-Aware Modality Connector for Speech Language Models	3568, rejected,	5.50
A Variational Approach for Generative Speech Language Modeling	3568, re	5.50
SPARQ: Outlier-free SpeechLM with Fast Adaptation and Robust Quantization	5566,re	5.50
Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement	3568, accepted	5.50
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback	3568,re	5.50
Time-Accurate Speech Rich Transcription with Non-Fluencies	5566 withdraw	5.50
dMel: Speech Tokenization Made Simple	35568 re	5.40
Orator: LLM-Guided Multi-Shot Speech Video Generation	35568 re	5.40
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer	3666, accepted, have big influence!	5.25
Strategic Filtering for Content Moderation: Free Speech or Free of Distortion?	5556, re	5.25
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control	35558, withdraw	5.20
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	3368, re	5.00

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 70+; We select 36 papers.

Paper	status	average rate
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency	con	8.00
CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation		7.60
$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics		7.50
ADIFF: Explaining audio difference using natural language		7.50
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation		7.50
FlowDec: A flow-based full-band general audio codec with high perceptual quality		7.00
I Can Hear You: Selective Robust Training for Deepfake Audio Detection	con	7.00
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes		7.00
RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction		6.80
Enhancing Deception Detection with Cognitive Load Features: An Audio-Visual Approach		6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio		6.75
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data		6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation		6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators		6.75
Fugatto 1: Foundational Generative Audio Transformer Opus 1		6.75
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling	35810	6.50
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation		6.50
ViSAGe: Video-to-Spatial Audio Generation		6.40
Aligned Better, Listen Better For Audio-Visual Large Language Models		6.25
Contrastive Learning from Synthetic Audio Doppelgängers		6.25
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models		6.20
Elucidating the Design Space of Text-to-Audio Models	5568, re	6.00
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation		6.00
Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation		6.00
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives		6.00
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics		5.80
Active Audio Cancellation with Multi-Band Mamba Network	3668, re	5.75
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio	5666, re	5.75
Token Pruning Meets Audio: Investigating Unique Behaviors in Vision Transformer-Based Audio Classification	55666, accepted	5.60
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models	3388, accepted	5.50
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics	3,5,8, accepted	5.33
Taming Data and Transformers for Audio Generation	3666, re	5.25
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation	5556, re	5.25
Segment, Associate, and Classify: Decoupled Audio-Visual Segmentation Framework	5556 withdraw	5.25
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI	3558, re	5.25
Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation	35558, withdraw	5.20
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback	3566, withdraw	5.00

Summary

The accepted(or not) status depends on rate mainly. The rate of speech/audio track is not high, which is much less than the tracks like CV, NLP, etc. The rebuttals are very important!!!

ICLR'24

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 50+; We select 20+ papers.

Paper	status	average rate
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers	Spot	8.00
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition	Spot	8.00
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction	Spot	8.00
Zipformer: A faster and better encoder for automatic speech recognition	Oral	7.50
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation		7.50
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech		7.00
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM		6.75
SALMONN: Towards Generic Hearing Abilities for Large Language Models		6.67
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition		6.60
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis		6.50
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech		6.40
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing	5668, re, link: https://arxiv.org/pdf/2309.00916	6.25
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation	5668, desk re, accepted by ACL2024, https://aclanthology.org/2024.findings-acl.593.pdf	6.25
Multilingual Visual Speech Recognition with a Single Model using Visual Speech Unit	56668, re, link: https://arxiv.org/pdf/2401.09802v1	6.20
PromptTTS 2: Describing and Generating Voices with Text Prompt		6.00
Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation		6.00
PolyVoice: Language Models for Speech to Speech Translation	3588, accepted	6.00
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models	5568, re, accepted by SIGGRAPH 2024 (Journal Track), https://arxiv.org/pdf/2310.00434	6.00
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading	5568,accepted	6.00
Generative Pre-training for Speech with Flow Matching	3668,accepted	5.75
DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation	5666,accepted	5.75
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models	3668,accepted	5.75
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding	3568, rem accepted by Interspeech24, https://arxiv.org/pdf/2307.07421	5.75
RepCodec: A Speech Representation Codec for Speech Tokenization	5566, re, accepted by ACL-main2024, https://arxiv.org/pdf/2309.00169	5.50
A Discrete and Variational Approach to Speech Representation Learning	33588, withdraw	5.40
Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer	5556, re, accepted by ACL2024, https://arxiv.org/pdf/2406.00976	5.25

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 20+; We select 17 papers.

Paper	status	average rate
Masked Audio Generation using a Single Non-Autoregressive Transformer		7.33
Listen, Think, and Understand		7.00
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation		6.67
Weakly-supervised Audio Separation via Bi-modal Semantic Similarity		6.67
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models		6.50
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis		6.00
Lifelong Audio-video Masked Autoencoder with Forget-robust Localized Alignments	55558， re	5.60
LAURAGPT: LISTEN, ATTEND, UNDERSTAND, AND REGENERATE AUDIO WITH GPT	5566， re	5.50
SoundStorm: Efficient Parallel Audio Generation	35568, re	5.40
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues	3666, re	5.25
FINE-GRAINED AUDIO-VISUAL JOINT REPRESENTATIONS FOR MULTIMODAL LARGE LANGUAGE MODELS	3666, re	5.25
UniAudio: An Audio Foundation Model Toward Universal Audio Generation	15510, re, accept by icml24	5.25
Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners	3666, re	5.25
SMILE: Audio-Visual Speech Recognition with Siamese Masked Interaction Learning	5555, re	5.00
Leveraging characteristics of the output distribution for identifying adversarial audio examples	5555, re	5.00
Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition	5555, re	5.00
WavJourney: Compositional Audio Creation with Large Language Models	35566, re	5.00

Summary

This year, the paper's number is not so large.

ICML'24

Speech

Paper	status
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis	link
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models	link
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	link
InstructSpeech: Following Speech Editing Instructions via Large Language Models	link
Scaling Speech Technology to 1,000+ Languages	link
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation	link
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data	link

Audio

Paper	status
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion	link
UniAudio: Towards Universal Audio Generation with Large Language Models	link
Prompt-guided Precise Audio Editing with Diffusion Models
Creative Text-to-Audio Generation via Synthesizer Programming
Fast Timing-Conditioned Latent Audio Diffusion
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Listenable Maps for Audio Classifiers
STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
AND: Audio Network Dissection for Interpreting Deep Acoustic Models
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

NeurIPS'24

Speech

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=speech

Paper	status
SSDM: Scalable Speech Dysfluency Modeling
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
A Full-duplex Speech Dialogue Scheme Based On Large Language Model
CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation
SILENCE: Protecting privacy in offloaded speech understanding on resource-constrained devices
FINALLY: fast and universal speech enhancement with studio-like quality
SpeechAlign: Aligning Speech Generation to Human Preferences
Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals
SCOREQ: Speech Quality Assessment with Contrastive Regression
RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation
Comprehensive Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for the Polish Language

Audio

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=audio

Paper	status
Vocal Call Locator Benchmark (VCL'24) for localizing rodent vocalizations from multi-channel audio
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Tell What You Hear From What You See - Video to Audio Generation Through Text
Learning Spatially-Aware Language and Audio Embeddings
Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Continual Audio-Visual Sound Separation
Mixtures of Experts for Audio-Visual Learning
Listenable Maps for Zero-Shot Audio Classifiers
Aligning Audio-Visual Joint Representations with an Agentic Workflow
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
AudioMarkBench: Benchmarking Robustness of Audio Watermarking

ICML'23

Speech

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=speech

Paper	status
Pre-training for Speech Translation: CTC Meets Optimal Transport	Oral
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language	Oral
Robust Speech Recognition via Large-Scale Weak Supervision
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation
Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
MetricGAN-OKD: Multi-Metric Optimization of MetricGAN via Online Knowledge Distillation for Speech Enhancement
Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models

Audio

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=audio

Paper	Status
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition
BEATs: Audio Pre-Training with Acoustic Tokenizers	Oral
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

NeurIPS'23

Speech

Paper	Status
High-Fidelity Audio Compression with Improved RVQGAN	Spot
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio	Spot
How to Scale Your EMA	Spot
Textually Pretrained Speech Language Models
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Parts of Speech–Grounded Subspaces in Vision-Language Models
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Disentangling Voice and Content with Self-Supervision for Speaker Recognition
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Unified Segment-to-Segment Framework for Simultaneous Sequence Generation
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Progressive Ensemble Distillation: Building Ensembles for Efficient Inference
LEACE: Perfect linear concept erasure in closed form
TART: A plug-and-play Transformer module for task-agnostic reasoning

Audio

Paper	Status
Compression with Bayesian Implicit Neural Representations	Spot
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Pengi: An Audio Language Model for Audio Tasks
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
MAViL: Masked Audio-Video Learners
Weakly-Supervised Audio-Visual Segmentation
Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Simple and Controllable Music Generation
CoLLAT: On Adding Fine-grained Audio Understanding to Language Models using Token-Level Locked-Language Tuning
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
Self-Supervised Visual Acoustic Matching
Connecting Multi-modal Contrastive Representations
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Achieving Cross Modal Generalization with Multimodal Unified Representation
Any-to-Any Generation via Composable Diffusion
Efficient Neural Music Generation
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Latent Diffusion for Language Generation
Block-State Transformers
Learning Interpretable Low-dimensional Representation via Physical Symmetry
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Feature Dropout: Revisiting the Role of Augmentations in Contrastive Learning
Language Semantic Graph Guided Data-Efficient Learning

ACMMM'24

Speech

Paper	Status
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling	Oral
UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis	Oral
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts	Oral
ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations	Oral
Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation	Oral
Generative Expressive Conversational Speech Synthesis
SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description
CIEASR:Contextual Image-Enhanced Automatic Speech Recognition for Improved Homophone Discrimination
EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture Generation
DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling
Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation
MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
SpeechEE: A Novel Benchmark for Speech Event Extraction
MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
FlashSpeech: Efficient Zero-Shot Speech Synthesis

Audio

Paper	Status
OpenAVE: Moving towards Open Set Audio-Visual Event Localization	Oral
Unveiling and Mitigating Bias in Audio Visual Segmentation	Oral
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset	Oral
Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization	Oral
Towards Trustworthy MetaShopping: Studying Manipulative Audiovisual Designs in Virtual-Physical Commercial Platforms	Oral
Open-Vocabulary Audio-Visual Semantic Segmentation	Oral
Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training	Oral
Toward Explainable Physical Audiovisual Commonsense Reasoning	Oral
TiVA: Time-Aligned Video-to-Audio Generation	Oral
Coarse-to-Fine Proposal Refinement Framework For Audio Temporal Forgery Detection and Localization	Oral
SelM: Selective Mechanism based Audio-Visual Segmentation	Oral
Dissecting Temporal Understanding in Text-to-Audio Retrieval
FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection
AMG-Embedding: a Self-Supervised Embedding Approach for Audio Identification
MMAL: Multi-Modal Analytic Learning for Exemplar-Free Audio-Visual Class Incremental Tasks
Utilizing Speaker Profiles for Impersonation Audio Detection
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization
CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks
Time-Frequency Domain Fusion Enhancement for Audio Super-Resolution
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition
Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier
AVHash: Joint Audio-Visual Hashing for Video Retrieval
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
EchoAudio: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps
Instance-Level Panoramic Audio-Visual Saliency Detection and Ranking
Audio-Driven Identity Manipulation for Face Inpainting
GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
TAS: Personalized Text-guided Audio Spatialization
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection

ICLR'23

Speech

Paper	Status
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
An efficient encoder-decoder architecture with top-down attention for speech separation
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Bag of Tricks for Unsupervised Text-to-Speech
In-Situ Text-Only Adaptation of Speech Models with Low-Overhead Speech Imputations
Revisiting the Entropy Semiring for Neural Speech Recognition
D4AM: A General Denoising Framework for Downstream Acoustic Models
Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation
BigVGAN: A Universal Neural Vocoder with Large-Scale Training
Continuous pseudo-labeling from the start
NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis
BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS

Audio

Paper	Status
Token Merging: Your ViT But Faster	Oral
Contrastive Audio-Visual Masked Autoencoder	Spot
AudioGen: Textually Guided Audio Generation
Defending against Adversarial Audio via Diffusion Model
wav2tok: Deep Sequence Tokenizer for Audio Retrieval
Continual Transformers: Redundancy-Free Attention for Online Inference
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis
Words are all you need? Language as an approximation for human similarity judgments

AAAI'24

useful link: https://aaai.org/wp-content/uploads/2024/02/AAAI-24_Main_2024-02-01.pdf

https://github.com/DmitryRyumin/AAAI-2024-Papers

Speech

Paper	Status
Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation	https://arxiv.org/abs/2312.10877
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding	https://arxiv.org/abs/2306.07547
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation	https://arxiv.org/abs/2401.03468
Visual Hallucination Elevates Speech Recognition	https://ojs.aaai.org/index.php/AAAI/article/view/29926
Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales	https://ojs.aaai.org/index.php/AAAI/article/view/29743
Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition	https://ojs.aaai.org/index.php/AAAI/article/view/29882
MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-toSpeech Synthesis	https://arxiv.org/abs/2312.10687
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling	https://arxiv.org/abs/2312.11947
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos	https://arxiv.org/abs/2308.15256
Divergence-Guided Simultaneous Speech Translation	https://ojs.aaai.org/index.php/AAAI/article/view/29733
SECap: Speech Emotion Captioning with Large Language Model	https://arxiv.org/abs/2312.10381
Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction	https://arxiv.org/abs/2312.10305

Audio

Paper	Status
AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis	https://arxiv.org/abs/2312.10921
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models	https://arxiv.org/abs/2308.09300
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection	https://arxiv.org/abs/2312.09651
Audio Generation with Multiple Conditional Diffusion Model	https://arxiv.org/abs/2308.11940
AVSegFormer: Audio-Visual Segmentation with Transformer	https://ojs.aaai.org/index.php/AAAI/article/view/29104
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation	https://arxiv.org/abs/2309.16429
Sample-Constrained Black Box Optimization for Audio Personalization	https://ojs.aaai.org/index.php/AAAI/article/view/28881
DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification	https://ojs.aaai.org/index.php/AAAI/article/view/29716
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments	https://arxiv.org/abs/2306.04047
Learning Temporal Resolution in Spectrogram for Audio Classification	https://arxiv.org/abs/2210.01719
SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network	https://arxiv.org/abs/2312.16149
Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation	https://arxiv.org/abs/2312.08673
Improving Audio-Visual Segmentation with Bidirectional Generation	https://arxiv.org/abs/2308.08288
Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification	https://ojs.aaai.org/index.php/AAAI/article/view/29015
Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering	https://arxiv.org/abs/2312.12816
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer	https://arxiv.org/abs/2309.07929

ACL'24

useful link: https://2024.aclweb.org/program/main_conference_papers/#long-papers

Speech

Paper	Authorlist	Status
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, EngSiong Chng	Long, link
Wav2Gloss: Generating Interlinear Glossed Text from Speech	Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel Romney Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R Mortensen, Lori Levin	https://aclanthology.org/2024.acl-long.34.pdf
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation	Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min zhang	https://aclanthology.org/2024.acl-long.85.pdf
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer	Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu	https://aclanthology.org/2024.acl-long.97.pdf
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?	Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli	https://aclanthology.org/2024.acl-long.789.pdf
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection	Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli	https://aclanthology.org/2024.acl-long.202.pdf
Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization?	Roshan Sharma, Suwon Shon, Mark Lindsey, Hira Dhamyal, Bhiksha Raj	https://aclanthology.org/2024.acl-long.790.pdf
LLM Knows Body Language, Too: Translating Speech Voices into Human Gestures	Chenghao Xu, Guangtao Lyu, Jiexi Yan, Muli Yang, Cheng Deng	https://aclanthology.org/2024.acl-long.273.pdf
RepCodec: A Speech Representation Codec for Speech Tokenization	Zhichao Huang, Chutong Meng, Tom Ko	https://aclanthology.org/2024.acl-long.314.pdf
Error-preserving Automatic Speech Recognition of Young English Learners’ Language	Janick Michot, Manuela Hürlimann, Jan Milan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak	https://aclanthology.org/2024.acl-long.348.pdf
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?	Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min zhang, Yang Feng	https://aclanthology.org/2024.acl-long.392.pdf
Multimodal Contextualized Semantic Parsing from Speech	Jordan Voas, David Harwath, Ray Mooney	https://aclanthology.org/2024.acl-long.398.pdf
SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network	Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo XU, Guoqi Li	https://aclanthology.org/2024.acl-long.429.pdf
Speech Sense Disambiguation: Tackling Homophone Ambiguity in End-to-End Speech Translation	Tengfei Yu, Xuebo Liu, Liang Ding, Kehai Chen, Dacheng Tao, Min Zhang	https://aclanthology.org/2024.acl-long.435.pdf
Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation	Keqi Deng, Phil Woodland	https://aclanthology.org/2024.acl-long.448.pdf
Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t	Chihiro Taguchi, David Chiang	https://aclanthology.org/2024.acl-long.827.pdf
Speech language models lack important brain-relevant semantics	SUBBA REDDY OOTA, Emin Çelik, Fatma Deniz, Mariya Toneva	https://aclanthology.org/2024.acl-long.462.pdf
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning	Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min zhang, Yang Feng	https://aclanthology.org/2024.acl-long.485.pdf
NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data	Manuel Tonneau, Pedro Vitor Quinta de Castro, Karim Lasri, Ibrahim Sambo Farouq, Lakshmi Subramanian, Victor Orozco-Olvera, Samuel Fraiberger	https://aclanthology.org/2024.acl-long.488v2.pdf
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation	Songju Lei, Xize Cheng, Mengjiao Lyu, Jianqiao Hu, Jintao Tan, Runlin Liu, Lingyu Xiong, Tao Jin, Xiandong Li, Zhou Zhao	https://aclanthology.org/2024.acl-long.543.pdf
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification	Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe	https://aclanthology.org/2024.acl-long.549.pdf
Don’t Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection	Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu	https://aclanthology.org/2024.acl-long.652.pdf
Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing	Freda Shi, Kevin Gimpel, Karen Livescu	https://aclanthology.org/2024.acl-long.666.pdf
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild	Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath	https://aclanthology.org/2024.acl-long.673.pdf
A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech	Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar	https://aclanthology.org/2024.acl-long.684.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception	HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang	https://aclanthology.org/2024.acl-long.697.pdf
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech	Shengpeng Ji, Ziyue Jiang, Wang Hanting, Jialung Zuo, Zhou Zhao	https://aclanthology.org/2024.acl-long.733.pdf
The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition	Enshi Zhang, Rafael Trujillo, Christian Poellabauer	https://aclanthology.org/2024.acl-long.752.pdf
Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech	Adrien Pupier, Maximin Coavoux, Jérôme Goulian, Benjamin Lecouteux	Short, link
Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster	Agostina Calabrese, Leonardo Neves, Neil Shah, Maarten W. Bos, Björn Ross, Mirella Lapata, Francesco Barbieri	https://aclanthology.org/2024.acl-short.38.pdf
On the Semantic Latent Space of Diffusion-Based Text-To-Speech Models	Miri Varshavsky, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin	https://aclanthology.org/2024.acl-short.24.pdf

Audio

Paper	Authorlist	Status
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension	Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou	Long, link
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection	Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli	https://aclanthology.org/2024.acl-long.202.pdf
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset	Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang	https://aclanthology.org/2024.acl-long.489.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception	HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang	https://aclanthology.org/2024.acl-long.697.pdf

EMNLP'24

useful link: https://2024.emnlp.org/program/accepted_main_conference/

https://2024.emnlp.org/program/accepted_findings/

Speech

Paper	Authorlist	Status
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection	Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps	Main, link
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model	Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia Perera, EngSiong Chng, Lina Yao	https://aclanthology.org/2024.emnlp-main.9.pdf
Scaling Properties of Speech Language Models	Santiago Cuervo, Ricard Marxer	https://aclanthology.org/2024.emnlp-main.21.pdf
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models	Maureen de Seyssel, Antony D’Avirro, Adina Williams, Emmanuel Dupoux	https://aclanthology.org/2024.emnlp-main.30.pdf
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering	Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini	https://aclanthology.org/2024.emnlp-main.201.pdf
AlignCap: Aligning Speech Emotion Captioning to Human Preferences	Ziqi Liang, Haoxiang Shi, Hanhui Chen	https://aclanthology.org/2024.emnlp-main.224.pdf
F$^2$RL: Factuality and Faithfulness Reinforcement Learning Framework for Claim-Guided Evidence-Supported Counterspeech Generation	Haiyang Wang, Yuchen Pan, Xin Song, Xuechen Zhao, Minghao Hu, Bin Zhou	https://aclanthology.org/2024.emnlp-main.255.pdf
Outcome-Constrained Large Language Models for Countering Hate Speech	Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song	https://aclanthology.org/2024.emnlp-main.260.pdf
On Mitigating Performance Disparities in Multilingual Speech Recognition	Monorama Swain, Anna Katrine van Zee, Anders Søgaard	https://aclanthology.org/2024.emnlp-main.323.pdf
Methods of Automatic Matrix Language Determination for Code-Switched Speech	Olga Iakovenko, Thomas Hain	https://aclanthology.org/2024.emnlp-main.330.pdf
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning	Ashish Seth, Ramaneswaran S, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha	https://aclanthology.org/2024.emnlp-main.366.pdf
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models	Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales	https://aclanthology.org/2024.emnlp-main.430.pdf
Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning	Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee	https://aclanthology.org/2024.emnlp-main.445.pdf
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition	Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-yi Lee	https://aclanthology.org/2024.emnlp-main.503.pdf
ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers	Yuzhe Gu, Enmao Diao	https://aclanthology.org/2024.emnlp-main.562.pdf
Towards Robust Speech Representation Learning for Thousands of Languages	William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe	https://aclanthology.org/2024.emnlp-main.570.pdf
Speechworthy Instruction-tuned Language Models	Hyundong Justin Cho, Nicolaas Paul Jedema, Leonardo F. R. Ribeiro, Karishma Sharma, Pedro Szekely, Alessandro Moschitti, Ruben Janssen, Jonathan May	https://aclanthology.org/2024.emnlp-main.595.pdf
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights	Hao Yang, Lizhen Qu, Ehsan Shareghi, Reza Haf	https://aclanthology.org/2024.emnlp-main.614.pdf
Integrating Argumentation and Hate-Speech-based Techniques for Countering Misinformation	Sougata Saha, Rohini Srihari	https://aclanthology.org/2024.emnlp-main.622.pdf
Unveiling the Role of Pretraining in Direct Speech Translation	Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà	https://aclanthology.org/2024.emnlp-main.630.pdf
Multi-Level Cross-Modal Alignment for Speech Relation Extraction	Liang Zhang, Zhen Yang, Biao Fu, Ziyao Lu, Liangying Shao, Shiyu Liu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jinsong Su	https://aclanthology.org/2024.emnlp-main.668.pdf
Self-Powered LLM Modality Expansion for Large Speech-Text Models	Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang	https://aclanthology.org/2024.emnlp-main.690.pdf
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach	Siqi Li, Danni Liu, Jan Niehues	https://aclanthology.org/2024.emnlp-main.708.pdf
Towards an Open-Source Speech Foundation Model for EU: 950,000 Hours of Open-Source Compliant Speech Data for EU Languages	Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri	https://aclanthology.org/2024.emnlp-main.771.pdf
VHASR: A Multimodal Speech Recognition System With Vision Hotwords	Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, hai zhao	https://aclanthology.org/2024.emnlp-main.821.pdf
AudioVSR: Enhancing Video Speech Recognition with Audio Data	Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin	https://aclanthology.org/2024.emnlp-main.858.pdf
Hate Personified: Investigating the role of LLMs in content moderation pipeline for hate speech	Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty	https://aclanthology.org/2024.emnlp-main.886.pdf
Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification	Esra Dönmez, Thang Vu, Agnieszka Falenska	https://aclanthology.org/2024.emnlp-main.1019.pdf
BLSP-Emo: Towards Empathetic Large Speech-Language Models	Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang	https://aclanthology.org/2024.emnlp-main.1070.pdf
Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection	Camilla Casula, Sebastiano Vecellio Salto, Alan Ramponi, Sara Tonelli
Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech	Guan-Ting Lin, Wei Ping Huang, Hung-yi Lee
Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding	YeonJoon Jung, Jaeseong Lee, Seungtaek Choi, Dohyeon Lee, Minsoo Kim, seung-won hwang
Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities	Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang
PREDICT: Multi-Agent-based Debate Simulation for Generalized Hate Speech Detection	Someen Park, Jaehoon Kim, Seungwan Jin, Sohyun Park, Kyungsik Han
TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR	Shashi Kumar, Srikanth Madikeri, Juan Pablo Zuluaga Gomez, Iuliia Thorbecke, Esaú VILLATORO-TELLO, Sergio Burdisso, Petr Motlicek, Karthik Pandia D S, Aravind Ganapathiraju
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps	Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, Dirk Hovy
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition	Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, et.al.
SpeechQE: Estimating the Quality of Direct Speech Translation	HyoJung Han, Kevin Duh, Marine Carpuat
Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model	Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Is Child-Directed Speech Effective Training Data for Language Models?	Steven Y. Feng, Noah Goodman, Michael Frank
HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models	Huy Nghiem, Hal Daumé III	Findings
PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition	Karima Kadaoui, Maryam Al Ali, Hawau Olamide Toyin, Ibrahim Mohammed, Hanan Aldarmaki
STTATTS: Unified Speech-To-Text And Text-To-Speech Model	Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
Contextualized Graph Representations for Generating Counter-Narrative against Hate Speech	Selene Baez Santamaria, Helena Gomez Adorno, Ilia Markov
LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models	Zuhair hasan shaik, Pradyoth Hegde, Prashant Bannulmath, Deepak K T
MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech	Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing	Jeonghun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro
Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation	G M Shahariar, Jia Chen, Jiachen Li, Yue Dong
Breaking the Boundaries: A Unified Framework for Chinese Named Entity Recognition Across Text and Speech	Jinzhong Ning, Yuanyuan Sun, Bo Xu, Zhihao Yang, Ling Luo, Hongfei Lin
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech	Youngjae Kim, Yejin Jeon, Gary Lee
Modeling Gender and Dialect Bias in Automatic Speech Recognition	Camille Harris, Chijioke Mgbahurike, Neha Kumar, Diyi Yang
LLM generated responses to mitigate the impact of hate speech	Jakub Podolak, Szymon Łukasik, Paweł Balawender, Jan Ossowski, Jan Piotrowski, Katarzyna Bąkowicz, Piotr Sankowski
BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation	David Dale, Marta R. Costa-jussà
Textless Speech-to-Speech Translation With Limited Parallel Data	Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems	Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS.	Onkar Kishor Susladkar, Vishesh Tripathi, Biddwan Ahmed
Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models	Ming Shan Hee, Shivam Sharma, RUI CAO, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, Roy Ka-Wei Lee
WavLLM: Towards Robust and Adaptive Speech Large Language Model	Shujie HU, Long Zhou, Shujie LIU, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

Audio

Paper	Authorlist	Status
IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding	Pengcheng Li, Xulong Zhang, Jing Xiao, Jianzong Wang	Main
Cross-Domain Audio Deepfake Detection: Dataset and Analysis	Yuang Li, Min Zhang, Mengxin Ren, Xiaosong Qiao, Miaomiao Ma, Daimeng Wei, Hao Yang
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities	Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation	Tanvir Mahmud, Diana Marculescu
AudioVSR: Enhancing Video Speech Recognition with Audio Data	Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin
PALM: Few-Shot Prompt Learning for Audio Language Models	Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models	Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D’Haro, Robby T. Tan, Haizhou Li	Findings
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding	Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon
Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Review	Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech	Youngjae Kim, Yejin Jeon, Gary Lee
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering	Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang
PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain	Jianyi Chen, Zheqi DAI, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, Wei Xue

Useful Survey & Awesome Link

Neural Codec & Speech Language Models: https://github.com/LqNoob/Neural-Codec-and-Speech-Language-Models
Controllable TTS: https://github.com/imxtx/awesome-controllabe-speech-synthesis
Expressive TTS: https://github.com/01Zhangbw/Awesome-Expressive-speech-synthesis
Disordered Speech: https://github.com/01Zhangbw/Awesome-Disordered-Speech
Large Audio Model: https://github.com/EmulationAI/awesome-large-audio-models
Codec-SuperB: https://github.com/voidful/Codec-SUPERB
Next Token Prediction: https://github.com/LMM101/Awesome-Multimodal-Next-Token-Prediction
Paper daily: https://github.com/halsay/ASR-TTS-paper-daily
Audio LLM: https://github.com/AudioLLMs/Awesome-Audio-LLM
Speech Trident: https://github.com/ga642381/speech-trident
Speech Pretrained: https://github.com/ddlBoJack/Awesome-Speech-Pretraining
TTS: https://github.com/wenet-e2e/speech-synthesis-paper
Speech Language model: https://github.com/ddlBoJack/Awesome-Speech-Language-Model
Amphion
InterSpeech23-24: https://github.com/DmitryRyumin/INTERSPEECH-2023-24-Papers
ICASSP23-24: https://github.com/DmitryRyumin/ICASSP-2023-24-Papers

Amphion v0.2 technical report https://arxiv.org/abs/2501.15442
Emilia-Large：更大杯，更多实验结果及细节 https://arxiv.org/abs/2501.15907
AnyEnhance：语音增强、歌声增强、说话人提取等等任务，AnyEnhance一个模型全搞定 https://arxiv.org/abs/2501.15417

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
README.md		README.md

License

01Zhangbw/Speech-and-audio-papers-Top-Conference

Folders and files

Latest commit

History

Repository files navigation

Speech and audio papers@Top Conference

ICLR'25

Speech

Audio

Summary

ICLR'24

Speech

Audio

Summary

ICML'24

Speech

Audio

NeurIPS'24

Speech

Audio

ICML'23

Speech

Audio

NeurIPS'23

Speech

Audio

ACMMM'24

Speech

Audio

ICLR'23

Speech

Audio

AAAI'24

Speech

Audio

ACL'24

Speech

Audio

EMNLP'24

Speech

Audio

Useful Survey & Awesome Link

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages