Skip to content

It includes papers on speech&audio field. Now update: ICLR2023-2025, ICML2023-2024, NeurIPS2023-2024, ACMMM2024, AAAI2024, ACL2024, EMNLP2024

License

Notifications You must be signed in to change notification settings

01Zhangbw/Speech-and-audio-papers-Top-Conference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

Speech and audio papers@Top Conference

Hi there! If you think this program is useful, welcome to star⭐. If you want to add some, don't hesitate to PR👆 or email📧 me([email protected])

🔥 NEW UPDATE: 31 Jan, 2025. 新年快乐!

🎉 [01/23/2025] UPDATE ICLR 2025 conference papers successfully!

🎉 [01/23/2025] UPDATE ICLR 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE ICML 2024 conference papers successfully!

🎉 [01/29/2025] UPDATE NeurIPS 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICML 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE NeurIPS 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE ACMMM 2024 conference papers successfully!

🎉 [01/30/2025] UPDATE ICLR 2023 conference papers successfully!

🎉 [01/30/2025] UPDATE AAAI 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE ACL 2024 conference papers successfully!

🎉 [01/31/2025] UPDATE EMNLP 2024 conference papers successfully!

Speech and audio papers@Top Conference

ICLR'25

ICLR'25 total submission: 11672; accepted: 3706 (31.75%)

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 100+; We select 49 papers.

re denotes rejected. con denotes conditionalonethicsreview. The numbers like 5668 denotes the detailed rate is 5,6,6,8.

Paper Status Average rate
TANGO: Co-Speech Gesture Video Reenactment with Hierarchical Audio Motion Embedding and Diffusion Interpolation con 8.50
Co$^{\mathbf{3}}$Gesture: Towards Coherent Concurrent Co-speech 3D Gesture Generation with Interactive Diffusion 7.50
Scaling Transformers for Low-Bitrate High-Quality Speech Coding 7.00
Context-aware Dynamic Pruning for Speech Foundation Models 7.00
Scaling Speech-Text Pre-training with Synthetic Interleaved Data con 7.00
CR-CTC: Consistency regularization on CTC for improved speech recognition 6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio 6.75
Three-in-One: Fast and Accurate Transducer for Hybrid-Autoregressive Speech Recognition 6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation 6.75
Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity 6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators 6.75
Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis 6.67
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation 6.50
LLaMA-Omni: Seamless Speech Interaction with Large Language Models 6.50
Objective Soups: Multilingual Multi-Task Acoustic Modeling for Automatic Speech Recognition not accepted but rate is good 6.50
SyllableLM: Learning Coarse Semantic Units for Speech Language Models 6.50
Improving Semantic Understanding in Speech Language Models via Brain-tuning 6.50
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios 6.50
Bridging the Data Provenance Gap Across Text, Speech, and Video 6.50
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis 6.40
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors 6.25
T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning 6.25
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation 6.25
GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling 6.00
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation 6.00
FIRING-Net: A filtered feature recycling network for speech enhancement 6.00
TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation 5.83
NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data 55568, rejected 5.80
Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis 5666, rejected 5.75
VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication 5666, rejected 5.75
Speech Robust Bench: A Robustness Benchmark For Speech Recognition 5666,accepted 5.75
OTTC: A differentiable alignment approach to automatic speech recognition 368, rejected 5.68
SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Toward Cutting-Edge Speech Generation Methods 566, rejected 5.67
Realistic-Gesture: Co-Speech Gesture Video Generation through Semantic-aware Gesture Representation 35668, rejected 5.60
A$^2$-Flow: Alignment-Aware Pre-training for Speech Synthesis with Flow Matching 3568, rejected 5.50
Representing speech through autoregressive prediction of cochlear tokens 5566, rejected 5.50
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching 3568, rejected, but have big influence! 5.50
ASROB: Measuring Automatic Speech Recognition from One Book 3568, rejected, 5.50
SSR: Alignment-Aware Modality Connector for Speech Language Models 3568, rejected, 5.50
A Variational Approach for Generative Speech Language Modeling 3568, re 5.50
SPARQ: Outlier-free SpeechLM with Fast Adaptation and Robust Quantization 5566,re 5.50
Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement 3568, accepted 5.50
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback 3568,re 5.50
Time-Accurate Speech Rich Transcription with Non-Fluencies 5566 withdraw 5.50
dMel: Speech Tokenization Made Simple 35568 re 5.40
Orator: LLM-Guided Multi-Shot Speech Video Generation 35568 re 5.40
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer 3666, accepted, have big influence! 5.25
Strategic Filtering for Content Moderation: Free Speech or Free of Distortion? 5556, re 5.25
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control 35558, withdraw 5.20
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers 3368, re 5.00

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR25 number is 70+; We select 36 papers.

Paper status average rate
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency con 8.00
CyberHost: A One-stage Diffusion Framework for Audio-driven Talking Body Generation 7.60
$\texttt{BirdSet}$: A Large-Scale Dataset for Audio Classification in Avian Bioacoustics 7.50
ADIFF: Explaining audio difference using natural language 7.50
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation 7.50
FlowDec: A flow-based full-band general audio codec with high perceptual quality 7.00
I Can Hear You: Selective Robust Training for Deepfake Audio Detection con 7.00
SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes 7.00
RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction 6.80
Enhancing Deception Detection with Cognitive Load Features: An Audio-Visual Approach 6.75
Sylber: Syllabic Embedding Representation of Speech from Raw Audio 6.75
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data 6.75
Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation 6.75
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators 6.75
Fugatto 1: Foundational Generative Audio Transformer Opus 1 6.75
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling 35810 6.50
EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation 6.50
ViSAGe: Video-to-Spatial Audio Generation 6.40
Aligned Better, Listen Better For Audio-Visual Large Language Models 6.25
Contrastive Learning from Synthetic Audio Doppelgängers 6.25
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models 6.20
Elucidating the Design Space of Text-to-Audio Models 5568, re 6.00
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation 6.00
Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation 6.00
Rethinking Audio-Visual Adversarial Vulnerability from Temporal and Modality Perspectives 6.00
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics 5.80
Active Audio Cancellation with Multi-Band Mamba Network 3668, re 5.75
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio 5666, re 5.75
Token Pruning Meets Audio: Investigating Unique Behaviors in Vision Transformer-Based Audio Classification 55666, accepted 5.60
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models 3388, accepted 5.50
NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics 3,5,8, accepted 5.33
Taming Data and Transformers for Audio Generation 3666, re 5.25
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation 5556, re 5.25
Segment, Associate, and Classify: Decoupled Audio-Visual Segmentation Framework 5556 withdraw 5.25
Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI 3558, re 5.25
Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation 35558, withdraw 5.20
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback 3566, withdraw 5.00

Summary

The accepted(or not) status depends on rate mainly. The rate of speech/audio track is not high, which is much less than the tracks like CV, NLP, etc. The rebuttals are very important!!!

ICLR'24

Speech

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 50+; We select 20+ papers.

Paper status average rate
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers Spot 8.00
Large Language Models are Efficient Learners of Noise-Robust Speech Recognition Spot 8.00
Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction Spot 8.00
Zipformer: A faster and better encoder for automatic speech recognition Oral 7.50
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation 7.50
Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech 7.00
Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM 6.75
SALMONN: Towards Generic Hearing Abilities for Large Language Models 6.67
It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition 6.60
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis 6.50
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech 6.40
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing 5668, re, link: https://arxiv.org/pdf/2309.00916 6.25
TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation 5668, desk re, accepted by ACL2024, https://aclanthology.org/2024.findings-acl.593.pdf 6.25
Multilingual Visual Speech Recognition with a Single Model using Visual Speech Unit 56668, re, link: https://arxiv.org/pdf/2401.09802v1 6.20
PromptTTS 2: Describing and Generating Voices with Text Prompt 6.00
Separate and Diffuse: Using a Pretrained Diffusion Model for Better Source Separation 6.00
PolyVoice: Language Models for Speech to Speech Translation 3588, accepted 6.00
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models 5568, re, accepted by SIGGRAPH 2024 (Journal Track), https://arxiv.org/pdf/2310.00434 6.00
LipVoicer: Generating Speech from Silent Videos Guided by Lip Reading 5568,accepted 6.00
Generative Pre-training for Speech with Flow Matching 3668,accepted 5.75
DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform Generation 5666,accepted 5.75
SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models 3668,accepted 5.75
SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding 3568, rem accepted by Interspeech24, https://arxiv.org/pdf/2307.07421 5.75
RepCodec: A Speech Representation Codec for Speech Tokenization 5566, re, accepted by ACL-main2024, https://arxiv.org/pdf/2309.00169 5.50
A Discrete and Variational Approach to Speech Representation Learning 33588, withdraw 5.40
Generative Pre-Trained Speech Language Model with Efficient Hierarchical Transformer 5556, re, accepted by ACL2024, https://arxiv.org/pdf/2406.00976 5.25

Audio

It includes the papers on speech (rate is good or middle, often more than 5), not limited to accepted or not.

Total speech papers@ICLR24 number is 20+; We select 17 papers.

Paper status average rate
Masked Audio Generation using a Single Non-Autoregressive Transformer 7.33
Listen, Think, and Understand 7.00
Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation 6.67
Weakly-supervised Audio Separation via Bi-modal Semantic Similarity 6.67
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models 6.50
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis 6.00
Lifelong Audio-video Masked Autoencoder with Forget-robust Localized Alignments 55558, re 5.60
LAURAGPT: LISTEN, ATTEND, UNDERSTAND, AND REGENERATE AUDIO WITH GPT 5566, re 5.50
SoundStorm: Efficient Parallel Audio Generation 35568, re 5.40
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues 3666, re 5.25
FINE-GRAINED AUDIO-VISUAL JOINT REPRESENTATIONS FOR MULTIMODAL LARGE LANGUAGE MODELS 3666, re 5.25
UniAudio: An Audio Foundation Model Toward Universal Audio Generation 15510, re, accept by icml24 5.25
Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners 3666, re 5.25
SMILE: Audio-Visual Speech Recognition with Siamese Masked Interaction Learning 5555, re 5.00
Leveraging characteristics of the output distribution for identifying adversarial audio examples 5555, re 5.00
Rethinking Audiovisual Segmentation with Semantic Quantization and Decomposition 5555, re 5.00
WavJourney: Compositional Audio Creation with Large Language Models 35566, re 5.00

Summary

This year, the paper's number is not so large.

ICML'24

Speech

Paper status
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis link
video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models link
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models link
InstructSpeech: Following Speech Editing Instructions via Large Language Models link
Scaling Speech Technology to 1,000+ Languages link
IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation link
Speech Self-Supervised Learning Using Diffusion Model Synthetic Data link

Audio

Paper status
Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion link
UniAudio: Towards Universal Audio Generation with Large Language Models link
Prompt-guided Precise Audio Editing with Diffusion Models
Creative Text-to-Audio Generation via Synthesizer Programming
Fast Timing-Conditioned Latent Audio Diffusion
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Listenable Maps for Audio Classifiers
STELLA: Continual Audio-Video Pre-training with SpatioTemporal Localized Alignment
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation
AND: Audio Network Dissection for Interpreting Deep Acoustic Models
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

NeurIPS'24

Speech

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=speech

Paper status
SSDM: Scalable Speech Dysfluency Modeling
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation
A Full-duplex Speech Dialogue Scheme Based On Large Language Model
CA-SSLR: Condition-Aware Self-Supervised Learning Representation for Generalized Speech Processing
Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models
DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation
SILENCE: Protecting privacy in offloaded speech understanding on resource-constrained devices
FINALLY: fast and universal speech enhancement with studio-like quality
SpeechAlign: Aligning Speech Generation to Human Preferences
Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals
SCOREQ: Speech Quality Assessment with Contrastive Regression
RealMAN: A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation
Comprehensive Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for the Polish Language

Audio

useful link: https://nips.cc/virtual/2024/papers.html?filter=titles&search=audio

Paper status
Vocal Call Locator Benchmark (VCL'24) for localizing rodent vocalizations from multi-channel audio
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Tell What You Hear From What You See - Video to Audio Generation Through Text
Learning Spatially-Aware Language and Audio Embeddings
Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes
SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Continual Audio-Visual Sound Separation
Mixtures of Experts for Audio-Visual Learning
Listenable Maps for Zero-Shot Audio Classifiers
Aligning Audio-Visual Joint Representations with an Agentic Workflow
AV-Cloud: Spatial Audio Rendering Through Audio-Visual Cloud Splatting
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching
A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation
Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs
VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time
UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner
AudioMarkBench: Benchmarking Robustness of Audio Watermarking

ICML'23

Speech

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=speech

Paper status
Pre-training for Speech Translation: CTC Meets Optimal Transport Oral
Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language Oral
Robust Speech Recognition via Large-Scale Weak Supervision
Shiftable Context: Addressing Training-Inference Context Mismatch in Simultaneous Speech Translation
Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
MetricGAN-OKD: Multi-Metric Optimization of MetricGAN via Online Knowledge Distillation for Speech Enhancement
Mu$^2$SLAM: Multitask, Multilingual Speech and Language Models

Audio

useful link: https://icml.cc/virtual/2023/papers.html?filter=titles&search=audio

Paper Status
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
A Unified Audio-Visual Learning Framework for Localization, Separation, and Recognition
BEATs: Audio Pre-Training with Acoustic Tokenizers Oral
Do You Remember? Overcoming Catastrophic Forgetting for Fake Audio Detection

NeurIPS'23

Speech

Paper Status
High-Fidelity Audio Compression with Improved RVQGAN Spot
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio Spot
How to Scale Your EMA Spot
Textually Pretrained Speech Language Models
ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text Translation
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
DOSE: Diffusion Dropout with Adaptive Prior for Speech Enhancement
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Parts of Speech–Grounded Subspaces in Vision-Language Models
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
Learning Repeatable Speech Embeddings Using An Intra-class Correlation Regularizer
Disentangling Voice and Content with Self-Supervision for Speaker Recognition
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Unified Segment-to-Segment Framework for Simultaneous Sequence Generation
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Progressive Ensemble Distillation: Building Ensembles for Efficient Inference
LEACE: Perfect linear concept erasure in closed form
TART: A plug-and-play Transformer module for task-agnostic reasoning

Audio

Paper Status
Compression with Bayesian Implicit Neural Representations Spot
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
Pengi: An Audio Language Model for Audio Tasks
AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models
MAViL: Masked Audio-Video Learners
Weakly-Supervised Audio-Visual Segmentation
Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Simple and Controllable Music Generation
CoLLAT: On Adding Fine-grained Audio Understanding to Language Models using Token-Level Locked-Language Tuning
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Modality-Independent Teachers Meet Weakly-Supervised Audio-Visual Event Parser
Dual Mean-Teacher: An Unbiased Semi-Supervised Framework for Audio-Visual Source Localization
Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual Downstream Tasks
Revisit Weakly-Supervised Audio-Visual Video Parsing from the Language Perspective
Self-Supervised Visual Acoustic Matching
Connecting Multi-modal Contrastive Representations
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale
Achieving Cross Modal Generalization with Multimodal Unified Representation
Any-to-Any Generation via Composable Diffusion
Efficient Neural Music Generation
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Latent Diffusion for Language Generation
Block-State Transformers
Learning Interpretable Low-dimensional Representation via Physical Symmetry
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Feature Dropout: Revisiting the Role of Augmentations in Contrastive Learning
Language Semantic Graph Guided Data-Efficient Learning

ACMMM'24

Speech

Paper Status
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling Oral
UniStyle: Unified Style Modeling for Speaking Style Captioning and Stylistic Speech Synthesis Oral
Boosting Speech Recognition Robustness to Modality-Distortion with Contrast-Augmented Prompts Oral
ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations Oral
Self-Supervised Emotion Representation Disentanglement for Speech-Preserving Facial Expression Manipulation Oral
Generative Expressive Conversational Speech Synthesis
SpeechCraft: A Fine-Grained Expressive Speech Dataset with Natural Language Description
CIEASR:Contextual Image-Enhanced Automatic Speech Recognition for Improved Homophone Discrimination
EGGesture: Entropy-Guided Vector Quantized Variational AutoEncoder for Co-Speech Gesture Generation
DEITalk: Speech-Driven 3D Facial Animation with Dynamic Emotional Intensity Modeling
Contrastive Context-Speech Pretraining for Expressive Text-to-Speech Synthesis
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
Speech Reconstruction from Silent Lip and Tongue Articulation by Diffusion Models and Text-Guided Pseudo Target Generation
MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation
SpeechEE: A Novel Benchmark for Speech Event Extraction
MambaGesture: Enhancing Co-Speech Gesture Generation with Mamba and Disentangled Multi-Modality Fusion
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
Enabling Synergistic Full-Body Control in Prompt-Based Co-Speech Motion Generation
FlashSpeech: Efficient Zero-Shot Speech Synthesis

Audio

Paper Status
OpenAVE: Moving towards Open Set Audio-Visual Event Localization Oral
Unveiling and Mitigating Bias in Audio Visual Segmentation Oral
AV-Deepfake1M: A Large-Scale LLM-Driven Audio-Visual Deepfake Dataset Oral
Tango 2: Aligning Diffusion-based Text-to-Audio Generative Models through Direct Preference Optimization Oral
Towards Trustworthy MetaShopping: Studying Manipulative Audiovisual Designs in Virtual-Physical Commercial Platforms Oral
Open-Vocabulary Audio-Visual Semantic Segmentation Oral
Advancing Multi-grained Alignment for Contrastive Language-Audio Pre-training Oral
Toward Explainable Physical Audiovisual Commonsense Reasoning Oral
TiVA: Time-Aligned Video-to-Audio Generation Oral
Coarse-to-Fine Proposal Refinement Framework For Audio Temporal Forgery Detection and Localization Oral
SelM: Selective Mechanism based Audio-Visual Segmentation Oral
Dissecting Temporal Understanding in Text-to-Audio Retrieval
FRADE: Forgery-aware Audio-distilled Multimodal Learning for Deepfake Detection
AMG-Embedding: a Self-Supervised Embedding Approach for Audio Identification
MMAL: Multi-Modal Analytic Learning for Exemplar-Free Audio-Visual Class Incremental Tasks
Utilizing Speaker Profiles for Impersonation Audio Detection
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization
CoPL:Parameter-Efficient Collaborative Prompt Learning for Audio-Visual Tasks
Time-Frequency Domain Fusion Enhancement for Audio Super-Resolution
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning
Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio Recognition
Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier
AVHash: Joint Audio-Visual Hashing for Video Retrieval
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
EchoAudio: Efficient and High-Quality Text-to-Audio Generation with Minimal Inference Steps
Instance-Level Panoramic Audio-Visual Saliency Detection and Ranking
Audio-Driven Identity Manipulation for Face Inpainting
GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
TAS: Personalized Text-guided Audio Spatialization
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
V2A-Mark: Versatile Deep Visual-Audio Watermarking for Manipulation Localization and Copyright Protection

ICLR'23

Speech

Paper Status
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation
An efficient encoder-decoder architecture with top-down attention for speech separation
Jointly Learning Visual and Auditory Speech Representations from Raw Data
Bag of Tricks for Unsupervised Text-to-Speech
In-Situ Text-Only Adaptation of Speech Models with Low-Overhead Speech Imputations
Revisiting the Entropy Semiring for Neural Speech Recognition
D4AM: A General Denoising Framework for Downstream Acoustic Models
Filter-Recovery Network for Multi-Speaker Audio-Visual Speech Separation
BigVGAN: A Universal Neural Vocoder with Large-Scale Training
Continuous pseudo-labeling from the start
NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis
BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS

Audio

Paper Status
Token Merging: Your ViT But Faster Oral
Contrastive Audio-Visual Masked Autoencoder Spot
AudioGen: Textually Guided Audio Generation
Defending against Adversarial Audio via Diffusion Model
wav2tok: Deep Sequence Tokenizer for Audio Retrieval
Continual Transformers: Redundancy-Free Attention for Online Inference
GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis
Words are all you need? Language as an approximation for human similarity judgments

AAAI'24

useful link: https://aaai.org/wp-content/uploads/2024/02/AAAI-24_Main_2024-02-01.pdf

https://github.com/DmitryRyumin/AAAI-2024-Papers

Speech

Paper Status
Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation https://arxiv.org/abs/2312.10877
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding https://arxiv.org/abs/2306.07547
Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation https://arxiv.org/abs/2401.03468
Visual Hallucination Elevates Speech Recognition https://ojs.aaai.org/index.php/AAAI/article/view/29926
Spanning the Spectrum of Hatred Detection: A Persian Multi-Label Hate Speech Dataset with Annotator Rationales https://ojs.aaai.org/index.php/AAAI/article/view/29743
Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition https://ojs.aaai.org/index.php/AAAI/article/view/29882
MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-toSpeech Synthesis https://arxiv.org/abs/2312.10687
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling https://arxiv.org/abs/2312.11947
Let There Be Sound: Reconstructing High Quality Speech from Silent Videos https://arxiv.org/abs/2308.15256
Divergence-Guided Simultaneous Speech Translation https://ojs.aaai.org/index.php/AAAI/article/view/29733
SECap: Speech Emotion Captioning with Large Language Model https://arxiv.org/abs/2312.10381
Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction https://arxiv.org/abs/2312.10305

Audio

Paper Status
AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis https://arxiv.org/abs/2312.10921
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models https://arxiv.org/abs/2308.09300
What to Remember: Self-Adaptive Continual Learning for Audio Deepfake Detection https://arxiv.org/abs/2312.09651
Audio Generation with Multiple Conditional Diffusion Model https://arxiv.org/abs/2308.11940
AVSegFormer: Audio-Visual Segmentation with Transformer https://ojs.aaai.org/index.php/AAAI/article/view/29104
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation https://arxiv.org/abs/2309.16429
Sample-Constrained Black Box Optimization for Audio Personalization https://ojs.aaai.org/index.php/AAAI/article/view/28881
DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification https://ojs.aaai.org/index.php/AAAI/article/view/29716
CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments https://arxiv.org/abs/2306.04047
Learning Temporal Resolution in Spectrogram for Audio Classification https://arxiv.org/abs/2210.01719
SoundCount: Sound Counting from Raw Audio with Dyadic Decomposition Neural Network https://arxiv.org/abs/2312.16149
Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation https://arxiv.org/abs/2312.08673
Improving Audio-Visual Segmentation with Bidirectional Generation https://arxiv.org/abs/2308.08288
Audio Scanning Network: Bridging Time and Frequency Domains for Audio Classification https://ojs.aaai.org/index.php/AAAI/article/view/29015
Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering https://arxiv.org/abs/2312.12816
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer https://arxiv.org/abs/2309.07929

ACL'24

useful link: https://2024.aclweb.org/program/main_conference_papers/#long-papers

Speech

Paper Authorlist Status
GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators Yuchen Hu, Chen Chen, Chao-Han Huck Yang, Ruizhe Li, Dong Zhang, Zhehuai Chen, EngSiong Chng Long, link
Wav2Gloss: Generating Interlinear Glossed Text from Speech Taiqi He, Kwanghee Choi, Lindia Tjuatja, Nathaniel Romney Robinson, Jiatong Shi, Shinji Watanabe, Graham Neubig, David R Mortensen, Lori Levin https://aclanthology.org/2024.acl-long.34.pdf
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Any Translation Zhengrui Ma, Qingkai Fang, Shaolei Zhang, Shoutao Guo, Yang Feng, Min zhang https://aclanthology.org/2024.acl-long.85.pdf
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu https://aclanthology.org/2024.acl-long.97.pdf
Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing? Marco Gaido, Sara Papi, Matteo Negri, Luisa Bentivogli https://aclanthology.org/2024.acl-long.789.pdf
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli https://aclanthology.org/2024.acl-long.202.pdf
Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization? Roshan Sharma, Suwon Shon, Mark Lindsey, Hira Dhamyal, Bhiksha Raj https://aclanthology.org/2024.acl-long.790.pdf
LLM Knows Body Language, Too: Translating Speech Voices into Human Gestures Chenghao Xu, Guangtao Lyu, Jiexi Yan, Muli Yang, Cheng Deng https://aclanthology.org/2024.acl-long.273.pdf
RepCodec: A Speech Representation Codec for Speech Tokenization Zhichao Huang, Chutong Meng, Tom Ko https://aclanthology.org/2024.acl-long.314.pdf
Error-preserving Automatic Speech Recognition of Young English Learners’ Language Janick Michot, Manuela Hürlimann, Jan Milan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak https://aclanthology.org/2024.acl-long.348.pdf
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min zhang, Yang Feng https://aclanthology.org/2024.acl-long.392.pdf
Multimodal Contextualized Semantic Parsing from Speech Jordan Voas, David Harwath, Ray Mooney https://aclanthology.org/2024.acl-long.398.pdf
SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo XU, Guoqi Li https://aclanthology.org/2024.acl-long.429.pdf
Speech Sense Disambiguation: Tackling Homophone Ambiguity in End-to-End Speech Translation Tengfei Yu, Xuebo Liu, Liang Ding, Kehai Chen, Dacheng Tao, Min Zhang https://aclanthology.org/2024.acl-long.435.pdf
Label-Synchronous Neural Transducer for E2E Simultaneous Speech Translation Keqi Deng, Phil Woodland https://aclanthology.org/2024.acl-long.448.pdf
Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t Chihiro Taguchi, David Chiang https://aclanthology.org/2024.acl-long.827.pdf
Speech language models lack important brain-relevant semantics SUBBA REDDY OOTA, Emin Çelik, Fatma Deniz, Mariya Toneva https://aclanthology.org/2024.acl-long.462.pdf
StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning Shaolei Zhang, Qingkai Fang, Shoutao Guo, Zhengrui Ma, Min zhang, Yang Feng https://aclanthology.org/2024.acl-long.485.pdf
NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data Manuel Tonneau, Pedro Vitor Quinta de Castro, Karim Lasri, Ibrahim Sambo Farouq, Lakshmi Subramanian, Victor Orozco-Olvera, Samuel Fraiberger https://aclanthology.org/2024.acl-long.488v2.pdf
Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation Songju Lei, Xize Cheng, Mengjiao Lyu, Jianqiao Hu, Jintao Tan, Runlin Liu, Lingyu Xiong, Tao Jin, Xiandong Li, Zhou Zhao https://aclanthology.org/2024.acl-long.543.pdf
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe https://aclanthology.org/2024.acl-long.549.pdf
Don’t Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection Min Zhang, Jianfeng He, Taoran Ji, Chang-Tien Lu https://aclanthology.org/2024.acl-long.652.pdf
Structured Tree Alignment for Evaluation of (Speech) Constituency Parsing Freda Shi, Kevin Gimpel, Karen Livescu https://aclanthology.org/2024.acl-long.666.pdf
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild Puyuan Peng, Po-Yao Huang, Shang-Wen Li, Abdelrahman Mohamed, David Harwath https://aclanthology.org/2024.acl-long.673.pdf
A Community-Centric Perspective for Characterizing and Detecting Anti-Asian Violence-Provoking Speech Gaurav Verma, Rynaa Grover, Jiawei Zhou, Binny Mathew, Jordan Kraemer, Munmun De Choudhury, Srijan Kumar https://aclanthology.org/2024.acl-long.684.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang https://aclanthology.org/2024.acl-long.697.pdf
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech Shengpeng Ji, Ziyue Jiang, Wang Hanting, Jialung Zuo, Zhou Zhao https://aclanthology.org/2024.acl-long.733.pdf
The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition Enshi Zhang, Rafael Trujillo, Christian Poellabauer https://aclanthology.org/2024.acl-long.752.pdf
Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Adrien Pupier, Maximin Coavoux, Jérôme Goulian, Benjamin Lecouteux Short, link
Explainability and Hate Speech: Structured Explanations Make Social Media Moderators Faster Agostina Calabrese, Leonardo Neves, Neil Shah, Maarten W. Bos, Björn Ross, Mirella Lapata, Francesco Barbieri https://aclanthology.org/2024.acl-short.38.pdf
On the Semantic Latent Space of Diffusion-Based Text-To-Speech Models Miri Varshavsky, Roy Hirsch, Regev Cohen, Tomer Golany, Daniel Freedman, Ehud Rivlin https://aclanthology.org/2024.acl-short.24.pdf

Audio

Paper Authorlist Status
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, Jingren Zhou Long, link
StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection Sara Papi, Marco Gaido, Matteo Negri, Luisa Bentivogli https://aclanthology.org/2024.acl-long.202.pdf
M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang https://aclanthology.org/2024.acl-long.489.pdf
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception HyoJung Han, Mohamed Anwar, Juan Pino, Wei-Ning Hsu, Marine Carpuat, Bowen Shi, Changhan Wang https://aclanthology.org/2024.acl-long.697.pdf

EMNLP'24

useful link: https://2024.emnlp.org/program/accepted_main_conference/

https://2024.emnlp.org/program/accepted_findings/

Speech

Paper Authorlist Status
When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, Julien Epps Main, link
Speaking in Wavelet Domain: A Simple and Efficient Approach to Speed up Speech Diffusion Model Xiangyu Zhang, Daijiao Liu, Hexin Liu, Qiquan Zhang, Hanyu Meng, Leibny Paola Garcia Perera, EngSiong Chng, Lina Yao https://aclanthology.org/2024.emnlp-main.9.pdf
Scaling Properties of Speech Language Models Santiago Cuervo, Ricard Marxer https://aclanthology.org/2024.emnlp-main.21.pdf
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models Maureen de Seyssel, Antony D’Avirro, Adina Williams, Emmanuel Dupoux https://aclanthology.org/2024.emnlp-main.30.pdf
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini https://aclanthology.org/2024.emnlp-main.201.pdf
AlignCap: Aligning Speech Emotion Captioning to Human Preferences Ziqi Liang, Haoxiang Shi, Hanhui Chen https://aclanthology.org/2024.emnlp-main.224.pdf
F$^2$RL: Factuality and Faithfulness Reinforcement Learning Framework for Claim-Guided Evidence-Supported Counterspeech Generation Haiyang Wang, Yuchen Pan, Xin Song, Xuechen Zhao, Minghao Hu, Bin Zhou https://aclanthology.org/2024.emnlp-main.255.pdf
Outcome-Constrained Large Language Models for Countering Hate Speech Lingzi Hong, Pengcheng Luo, Eduardo Blanco, Xiaoying Song https://aclanthology.org/2024.emnlp-main.260.pdf
On Mitigating Performance Disparities in Multilingual Speech Recognition Monorama Swain, Anna Katrine van Zee, Anders Søgaard https://aclanthology.org/2024.emnlp-main.323.pdf
Methods of Automatic Matrix Language Determination for Code-Switched Speech Olga Iakovenko, Thomas Hain https://aclanthology.org/2024.emnlp-main.330.pdf
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning Ashish Seth, Ramaneswaran S, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha https://aclanthology.org/2024.emnlp-main.366.pdf
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales https://aclanthology.org/2024.emnlp-main.430.pdf
Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee https://aclanthology.org/2024.emnlp-main.445.pdf
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition Hsuan Su, Hua Farn, Fan-Yun Sun, Shang-Tse Chen, Hung-yi Lee https://aclanthology.org/2024.emnlp-main.503.pdf
ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers Yuzhe Gu, Enmao Diao https://aclanthology.org/2024.emnlp-main.562.pdf
Towards Robust Speech Representation Learning for Thousands of Languages William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe https://aclanthology.org/2024.emnlp-main.570.pdf
Speechworthy Instruction-tuned Language Models Hyundong Justin Cho, Nicolaas Paul Jedema, Leonardo F. R. Ribeiro, Karishma Sharma, Pedro Szekely, Alessandro Moschitti, Ruben Janssen, Jonathan May https://aclanthology.org/2024.emnlp-main.595.pdf
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights Hao Yang, Lizhen Qu, Ehsan Shareghi, Reza Haf https://aclanthology.org/2024.emnlp-main.614.pdf
Integrating Argumentation and Hate-Speech-based Techniques for Countering Misinformation Sougata Saha, Rohini Srihari https://aclanthology.org/2024.emnlp-main.622.pdf
Unveiling the Role of Pretraining in Direct Speech Translation Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà https://aclanthology.org/2024.emnlp-main.630.pdf
Multi-Level Cross-Modal Alignment for Speech Relation Extraction Liang Zhang, Zhen Yang, Biao Fu, Ziyao Lu, Liangying Shao, Shiyu Liu, Fandong Meng, Jie Zhou, Xiaoli Wang, Jinsong Su https://aclanthology.org/2024.emnlp-main.668.pdf
Self-Powered LLM Modality Expansion for Large Speech-Text Models Tengfei Yu, Xuebo Liu, Zhiyi Hou, Liang Ding, Dacheng Tao, Min Zhang https://aclanthology.org/2024.emnlp-main.690.pdf
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach Siqi Li, Danni Liu, Jan Niehues https://aclanthology.org/2024.emnlp-main.708.pdf
Towards an Open-Source Speech Foundation Model for EU: 950,000 Hours of Open-Source Compliant Speech Data for EU Languages Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri https://aclanthology.org/2024.emnlp-main.771.pdf
VHASR: A Multimodal Speech Recognition System With Vision Hotwords Jiliang Hu, Zuchao Li, Ping Wang, Haojun Ai, Lefei Zhang, hai zhao https://aclanthology.org/2024.emnlp-main.821.pdf
AudioVSR: Enhancing Video Speech Recognition with Audio Data Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin https://aclanthology.org/2024.emnlp-main.858.pdf
Hate Personified: Investigating the role of LLMs in content moderation pipeline for hate speech Sarah Masud, Sahajpreet Singh, Viktor Hangya, Alexander Fraser, Tanmoy Chakraborty https://aclanthology.org/2024.emnlp-main.886.pdf
Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification Esra Dönmez, Thang Vu, Agnieszka Falenska https://aclanthology.org/2024.emnlp-main.1019.pdf
BLSP-Emo: Towards Empathetic Large Speech-Language Models Chen Wang, Minpeng Liao, Zhongqiang Huang, Junhong Wu, Chengqing Zong, Jiajun Zhang https://aclanthology.org/2024.emnlp-main.1070.pdf
Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection Camilla Casula, Sebastiano Vecellio Salto, Alan Ramponi, Sara Tonelli
Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech Guan-Ting Lin, Wei Ping Huang, Hung-yi Lee
Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding YeonJoon Jung, Jaeseong Lee, Seungtaek Choi, Dohyeon Lee, Minsoo Kim, seung-won hwang
Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang
PREDICT: Multi-Agent-based Debate Simulation for Generalized Hate Speech Detection Someen Park, Jaehoon Kim, Seungwan Jin, Sohyun Park, Kyungsik Han
TokenVerse: Unifying Speech and NLP Tasks via Transducer-based ASR Shashi Kumar, Srikanth Madikeri, Juan Pablo Zuluaga Gomez, Iuliia Thorbecke, Esaú VILLATORO-TELLO, Sergio Burdisso, Petr Motlicek, Karthik Pandia D S, Aravind Ganapathiraju
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps Giuseppe Attanasio, Beatrice Savoldi, Dennis Fucci, Dirk Hovy
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, et.al.
SpeechQE: Estimating the Quality of Direct Speech Translation HyoJung Han, Kevin Duh, Marine Carpuat
Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Is Child-Directed Speech Effective Training Data for Language Models? Steven Y. Feng, Noah Goodman, Michael Frank
HateCOT: An Explanation-Enhanced Dataset for Generalizable Offensive Speech Detection via Large Language Models Huy Nghiem, Hal Daumé III Findings
PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition Karima Kadaoui, Maryam Al Ali, Hawau Olamide Toyin, Ibrahim Mohammed, Hanan Aldarmaki
STTATTS: Unified Speech-To-Text And Text-To-Speech Model Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
Contextualized Graph Representations for Generating Counter-Narrative against Hate Speech Selene Baez Santamaria, Helena Gomez Adorno, Ilia Markov
LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models Zuhair hasan shaik, Pradyoth Hegde, Prashant Bannulmath, Deepak K T
MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing Jeonghun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro
Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation G M Shahariar, Jia Chen, Jiachen Li, Yue Dong
Breaking the Boundaries: A Unified Framework for Chinese Named Entity Recognition Across Text and Speech Jinzhong Ning, Yuanyuan Sun, Bo Xu, Zhihao Yang, Ling Luo, Hongfei Lin
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech Youngjae Kim, Yejin Jeon, Gary Lee
Modeling Gender and Dialect Bias in Automatic Speech Recognition Camille Harris, Chijioke Mgbahurike, Neha Kumar, Diyi Yang
LLM generated responses to mitigate the impact of hate speech Jakub Podolak, Szymon Łukasik, Paweł Balawender, Jan Ossowski, Jan Piotrowski, Katarzyna Bąkowicz, Piotr Sankowski
BLASER 2.0: a metric for evaluation and quality estimation of massively multilingual speech and text translation David Dale, Marta R. Costa-jussà
Textless Speech-to-Speech Translation With Limited Parallel Data Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems Kentaro Mitsui, Koh Mitsuda, Toshiaki Wakatsuki, Yukiya Hono, Kei Sawada
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTS. Onkar Kishor Susladkar, Vishesh Tripathi, Biddwan Ahmed
Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models Ming Shan Hee, Shivam Sharma, RUI CAO, Palash Nandi, Preslav Nakov, Tanmoy Chakraborty, Roy Ka-Wei Lee
WavLLM: Towards Robust and Adaptive Speech Large Language Model Shujie HU, Long Zhou, Shujie LIU, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei

Audio

Paper Authorlist Status
IDEAW: Robust Neural Audio Watermarking with Invertible Dual-Embedding Pengcheng Li, Xulong Zhang, Jing Xiao, Jianzong Wang Main
Cross-Domain Audio Deepfake Detection: Dataset and Analysis Yuang Li, Min Zhang, Mengxin Ren, Xiaosong Qiao, Miaomiao Ma, Daimeng Wei, Hao Yang
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
OpenSep: Leveraging Large Language Models with Textual Inversion for Open World Audio Separation Tanvir Mahmud, Diana Marculescu
AudioVSR: Enhancing Video Speech Recognition with Audio Data Xiaoda Yang, Xize Cheng, Jiaqi Duan, Hongshun Qiu, Minjie Hong, Minghui Fang, Shengpeng Ji, Jialong Zuo, Zhiqing Hong, Zhimeng Zhang, Tao Jin
PALM: Few-Shot Prompt Learning for Audio Language Models Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D’Haro, Robby T. Tan, Haizhou Li Findings
AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding Alessandro Suglia, Claudio Greco, Katie Baker, Jose L. Part, Ioannis Papaioannou, Arash Eshghi, Ioannis Konstas, Oliver Lemon
Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Review Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, Aman Chadha
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech Youngjae Kim, Yejin Jeon, Gary Lee
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang
PyramidCodec: Hierarchical Codec for Long-form Music Generation in Audio Domain Jianyi Chen, Zheqi DAI, Zhen Ye, Xu Tan, Qifeng Liu, Yike Guo, Wei Xue

Useful Survey & Awesome Link

  1. Amphion v0.2 technical report https://arxiv.org/abs/2501.15442

  2. Emilia-Large:更大杯,更多实验结果及细节 https://arxiv.org/abs/2501.15907

  3. AnyEnhance:语音增强、歌声增强、说话人提取等等任务,AnyEnhance一个模型全搞定 https://arxiv.org/abs/2501.15417

About

It includes papers on speech&audio field. Now update: ICLR2023-2025, ICML2023-2024, NeurIPS2023-2024, ACMMM2024, AAAI2024, ACL2024, EMNLP2024

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published