Audio Understanding

Publish Date	Title	Authors	PDF	Code
2025-06-26	GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation	Wentao Hu et.al.	2506.21513v1	null
2025-06-26	SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture	Kehan Sui et.al.	2506.21478v1	null
2025-06-26	Aligning Spoken Dialogue Models from User Interactions	Anne Wu et.al.	2506.21463v1	null
2025-06-26	ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing	Huadai Liu et.al.	2506.21448v1	null
2025-06-26	Learnable Adaptive Time-Frequency Representation via Differentiable Short-Time Fourier Transform	Maxime Leiber et.al.	2506.21440v1	null
2025-06-26	Deception Detection in Dyadic Exchanges Using Multimodal Machine Learning: A Study on a Swedish Cohort	Franco Rugolon et.al.	2506.21429v1	null
2025-06-26	Hybrid Deep Learning and Signal Processing for Arabic Dialect Recognition in Low-Resource Settings	Ghazal Al-Shwayyat et.al.	2506.21386v1	null
2025-06-26	Exploring Adapter Design Tradeoffs for Low Resource Music Generation	Atharva Mehta et.al.	2506.21298v1	null
2025-06-26	Integrating Vehicle Acoustic Data for Enhanced Urban Traffic Management: A Study on Speed Classification in Suzhou	Pengfei Fan et.al.	2506.21269v1	null
2025-06-26	Prompt-Guided Turn-Taking Prediction	Koji Inoue et.al.	2506.21191v1	null
2025-06-26	Performance improvement of spatial semantic segmentation with enriched audio features and agent-based error correction for DCASE 2025 Challenge Task 4	Jongyeon Park et.al.	2506.21174v1	null
2025-06-26	A Hierarchical Deep Learning Approach for Minority Instrument Detection	Dylan Sechet et.al.	2506.21167v1	null
2025-06-26	Post-training for Deepfake Speech Detection	Wanying Ge et.al.	2506.21090v1	null
2025-06-26	PeakNetFP: Peak-based Neural Audio Fingerprinting Robust to Extreme Time Stretching	Guillem Cortès-Sebastià et.al.	2506.21086v1	null
2025-06-26	CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate	Hankun Wang et.al.	2506.21074v1	null
2025-06-26	Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance	Akio Hayakawa et.al.	2506.20995v1	null
2025-06-26	OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs	Yiman Zhang et.al.	2506.20960v1	null
2025-06-26	A Multi-Stage Framework for Multimodal Controllable Speech Synthesis	Rui Niu et.al.	2506.20945v1	null
2025-06-25	Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers	Furkan Mumcu et.al.	2506.20816v1	null
2025-06-25	Deciphering GunType Hierarchy through Acoustic Analysis of Gunshot Recordings	Ankit Shah et.al.	2506.20609v1	null
2025-06-25	Multimodal Representation Learning and Fusion	Qihang Jin et.al.	2506.20494v1	null
2025-06-25	The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models	Yi Wang et.al.	2506.20361v1	null
2025-06-25	Feature Hallucination for Self-supervised Action Recognition	Lei Wang et.al.	2506.20342v1	null
2025-06-25	Malicious earworms and useful memes, how the far-right surfs on TikTok audio trends	Marloes Geboers et.al.	2506.20695v1	null
2025-06-25	Lightweight Target-Speaker-Based Overlap Transcription for Practical Streaming ASR	Aleš Pražák et.al.	2506.20288v1	null
2025-06-25	CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment	Papa Séga Wade et.al.	2506.20243v1	null
2025-06-25	An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS	Marie Kunešová et.al.	2506.20190v1	null
2025-06-25	MEL: Multi-level Ensemble Learning for Resource-Constrained Environments	Krishna Praneet Gudipaty et.al.	2506.20094v1	null
2025-06-24	Neuromorphic Wireless Split Computing with Resonate-and-Fire Neurons	Dengyu Wu et.al.	2506.20015v1	null
2025-06-24	Improved Topology-Independent Distributed Adaptive Node-Specific Signal Estimation for Wireless Acoustic Sensor Networks	Paul Didier et.al.	2506.20001v1	null