2024-09-16 |
An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems |
Hitesh Tulsiani et.al. |
2409.10515v1 |
null |
2024-09-16 |
MusicLIME: Explainable Multimodal Music Understanding |
Theodoros Sotirou et.al. |
2409.10496v1 |
link |
2024-09-16 |
Meta-Whisper: Speech-Based Meta-ICL for ASR on Low-Resource Languages |
Ming-Hao Hsu et.al. |
2409.10429v1 |
null |
2024-09-16 |
Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement |
Wenze Ren et.al. |
2409.10376v1 |
null |
2024-09-16 |
Ultra-Low Latency Speech Enhancement - A Comprehensive Study |
Haibin Wu et.al. |
2409.10358v1 |
null |
2024-09-16 |
2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation? |
Téo Guichoux et.al. |
2409.10357v1 |
null |
2024-09-16 |
DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis |
Fa-Ting Hong et.al. |
2409.10281v1 |
null |
2024-09-16 |
oboVox Far Field Speaker Recognition: A Novel Data Augmentation Approach with Pretrained Models |
Muhammad Sudipto Siam Dip et.al. |
2409.10240v1 |
null |
2024-09-16 |
Speech as a Biomarker for Disease Detection |
Catarina Botelho et.al. |
2409.10230v1 |
null |
2024-09-16 |
RF-GML: Reference-Free Generative Machine Listener |
Arijit Biswas et.al. |
2409.10210v1 |
null |
2024-09-16 |
Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization |
Xiaoxue Gao et.al. |
2409.10157v1 |
null |
2024-09-16 |
Room impulse response prototyping using receiver distance estimations for high quality room equalisation algorithms |
James Brooks-Park et.al. |
2409.10131v1 |
null |
2024-09-16 |
Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT |
Ryota Komatsu et.al. |
2409.10103v1 |
link |
2024-09-16 |
Optimizing Dysarthria Wake-Up Word Spotting: An End-to-End Approach for SLT 2024 LRDWWS Challenge |
Shuiyun Liu et.al. |
2409.10076v1 |
null |
2024-09-16 |
Speaker Contrastive Learning for Source Speaker Tracing |
Qing Wang et.al. |
2409.10072v1 |
null |
2024-09-16 |
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion |
Yinghao Aaron Li et.al. |
2409.10058v1 |
null |
2024-09-16 |
TBDM-Net: Bidirectional Dense Networks with Gender Information for Speech Emotion Recognition |
Vlad Striletchi et.al. |
2409.10056v1 |
null |
2024-09-16 |
Audio-Driven Reinforcement Learning for Head-Orientation in Naturalistic Environments |
Wessel Ledder et.al. |
2409.10048v1 |
null |
2024-09-16 |
DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval |
Yifei Xin et.al. |
2409.10025v1 |
null |
2024-09-16 |
DNN-based ensemble singing voice synthesis with interactions between singers |
Hiroaki Hyodo et.al. |
2409.09988v1 |
null |
2024-09-16 |
A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models |
Ryandhimas E. Zezario et.al. |
2409.09914v1 |
null |
2024-09-15 |
Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning |
Siqi Sun et.al. |
2409.09891v1 |
null |
2024-09-15 |
Constructing a Singing Style Caption Dataset |
Hyunjong Ok et.al. |
2409.09866v1 |
link |
2024-09-15 |
Efficient Video to Audio Mapper with Visual Scene Detection |
Mingjing Yi et.al. |
2409.09823v1 |
null |
2024-09-15 |
Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition |
Chao-Han Huck Yang et.al. |
2409.09785v2 |
null |
2024-09-15 |
Self-supervised Multimodal Speech Representations for the Assessment of Schizophrenia Symptoms |
Gowtham Premananth et.al. |
2409.09733v1 |
null |
2024-09-15 |
A Comprehensive Methodological Survey of Human Activity Recognition Across Divers Data Modalities |
Jungpil Shin et.al. |
2409.09678v1 |
null |
2024-09-15 |
Self-supervised Learning for Acoustic Few-Shot Classification |
Jingyong Liang et.al. |
2409.09647v1 |
null |
2024-09-15 |
Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement |
Yudong Yang et.al. |
2409.09642v1 |
null |
2024-09-15 |
Stutter-Solver: End-to-end Multi-lingual Dysfluency Detection |
Xuanru Zhou et.al. |
2409.09621v1 |
link |