Paper Digest: Recent Papers on Speech Recognition
Paper Digest Team extracted all recent Speech Recognition related papers on our radar, and generated highlight sentences for them. The results are then sorted by relevance & date. In addition to this ‘static’ page, we also provide a real-time version of this article, which has more coverage and is updated in real time to include the most recent updates on this topic.
This list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that empowers you to read, write, get answers and review.
Try us today and unlock the full potential of our services for free!
TABLE 1: Paper Digest: Recent Papers on Speech Recognition
Paper | Author(s) | Source | Date | |
---|---|---|---|---|
1 | OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. |
XUELONG GENG et. al. | arxiv-cs.SD | 2025-01-22 |
2 | FlanEC: Exploring Flan-T5 for Post-ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present an encoder-decoder model leveraging Flan-T5 for post-Automatic Speech Recognition (ASR) Generative Speech Error Correction (GenSEC), and we refer to it as FlanEC. |
Moreno La Quatra; Valerio Mario Salerno; Yu Tsao; Sabato Marco Siniscalchi; | arxiv-cs.CL | 2025-01-22 |
3 | Investigation of Whisper ASR Hallucinations Induced By Non-Speech Audio Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. |
MATEUSZ BARAŃSKI et. al. | arxiv-cs.SD | 2025-01-20 |
4 | Delayed Fusion: Integrating Large Language Models Into First-Pass Decoding in End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). |
Takaaki Hori; Martin Kocour; Adnan Haider; Erik McDermott; Xiaodan Zhuang; | arxiv-cs.CL | 2025-01-15 |
5 | Selective Attention Merging for Low Resource Tasks: A Case Study of Child ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. |
Natarajan Balaji Shankar; Zilai Wang; Eray Eren; Abeer Alwan; | arxiv-cs.CL | 2025-01-14 |
6 | AdaCS: Adaptive Normalization for Enhanced Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose AdaCS, a normalization model integrates an adaptive bias attention module (BAM) into encoder-decoder network. |
The Chuong Chu; Vu Tuan Dat Pham; Kien Dao; Hoang Nguyen; Quoc Hung Truong; | arxiv-cs.CL | 2025-01-13 |
7 | Joint Automatic Speech Recognition And Structure Learning For Better Speech Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. |
JILIANG HU et. al. | arxiv-cs.SD | 2025-01-13 |
8 | Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the evaluation of multilingual SLU remains limited to shallower tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses topical speech classification in 102 languages and multiple-choice question answering through listening comprehension in 92 languages. |
Fabian David Schmidt; Ivan Vulić; Goran Glavaš; David Ifeoluwa Adelani; | arxiv-cs.CL | 2025-01-10 |
9 | Comparing Self-Supervised Learning Models Pre-Trained on Human Speech and Animal Vocalizations for Bioacoustics Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We conduct a comparative analysis using three diverse bioacoustic datasets and two different bioacoustic tasks. |
Eklavya Sarkar; Mathew Magimai. -Doss; | arxiv-cs.LG | 2025-01-10 |
10 | Benchmarking Rotary Position Embeddings for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we conduct a comprehensive evaluation of RoPE across diverse automatic speech recognition (ASR) tasks. |
Shucong Zhang; Titouan Parcollet; Rogier van Dalen; Sourav Bhattacharya; | arxiv-cs.CL | 2025-01-10 |
11 | Universal-2-TF: Robust All-Neural Text Formatting for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). |
Yash Khare; Taufiquzzaman Peyash; Andrea Vanzo; Takuya Yoshioka; | arxiv-cs.CL | 2025-01-10 |
12 | Samba-ASR: State-Of-The-Art Speech Recognition Leveraging Structured State-Space Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Samba ASR,the first state of the art Automatic Speech Recognition(ASR)model leveraging the novel Mamba architecture as both encoder and decoder,built on the foundation of state space models(SSMs). |
Syed Abdul Gaffar Shakhadri; Kruthika KR; Kartik Basavaraj Angadi; | arxiv-cs.CL | 2025-01-06 |
13 | Reducing The Gap Between Pretrained Speech Enhancement and Recognition Models Using A Real Speech-Trained Bridging Module Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose training strategies to train the bridging module with real noisy speech. |
ZHONGJIAN CUI et. al. | arxiv-cs.SD | 2025-01-05 |
14 | Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to perform a controlled architectural comparison, we train all models from scratch, rather than using large pretrained models, and use comparable data and parameter settings, testing speech-to-text recognition (ASR) and translation (ST) on MuST-C v1.0 and CoVoST2 datasets. |
Tsz Kin Lam; Marco Gaido; Sara Papi; Luisa Bentivogli; Barry Haddow; | arxiv-cs.CL | 2025-01-04 |
15 | Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of “listening and seeing again”. |
Rui Liu; Hongyu Yuan; Haizhou Li; | arxiv-cs.MM | 2025-01-03 |
16 | Whisper Turns Stronger: Augmenting Wav2Vec 2.0 for Superior ASR in Low-Resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an end-to-end framework that enhances ASR systems fine-tuned on Wav2Vec2 through data augmentation techniques. |
Or Haim Anidjar; Revital Marbel; Roi Yozevitch; | arxiv-cs.CL | 2024-12-31 |
17 | Zero-resource Speech Translation and Recognition with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. |
KAREL MUNDNICH et. al. | arxiv-cs.CL | 2024-12-24 |
18 | Adapting Whisper for Code-Switching Through Encoding Refining and Language-Aware Decoding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we adapt Whisper, which is a large-scale multilingual pre-trained speech recognition model, to CS from both encoder and decoder parts. |
JIAHUI ZHAO et. al. | arxiv-cs.CL | 2024-12-21 |
19 | CAMEL: Cross-Attention Enhanced Mixture-of-Experts and Language Bias for Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. |
HE WANG et. al. | arxiv-cs.SD | 2024-12-17 |
20 | Style-agnostic Evaluation of ASR Using Multiple Reference Transcripts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Evaluation datasets suffer from varying style, formality, and inherent ambiguity of the transcription task. In this work, we attempt to mitigate some of these differences by performing style-agnostic evaluation of ASR systems using multiple references transcribed under opposing style parameters. |
QUINTEN MCNAMARA et. al. | arxiv-cs.CL | 2024-12-10 |
21 | ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We apply LLMs to ASR error correction in three paradigms. |
Victor Junqiu Wei; Weicheng Wang; Di Jiang; Yuanfeng Song; Lu Wang; | arxiv-cs.CL | 2024-12-04 |
22 | Speech Recognition-based Feature Extraction for Enhanced Automatic Severity Classification in Dysarthric Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. |
Yerin Choi; Jeehyun Lee; Myoung-Wan Koo; | arxiv-cs.SD | 2024-12-04 |
23 | GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. |
AOHAN ZENG et. al. | arxiv-cs.CL | 2024-12-03 |
24 | A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, in this work, we aim to explore the capability of LLMs in low resource ASR and Mandarin-English code switching ASR. |
Zheshu Song; Ziyang Ma; Yifan Yang; Jianheng Zhuo; Xie Chen; | arxiv-cs.AI | 2024-12-01 |
25 | Continual Learning in Machine Speech Chain Using Gradient Episodic Memory Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). |
GEOFFREY TYNDALL et. al. | arxiv-cs.CL | 2024-11-27 |
26 | MSA-ASR: Efficient Multilingual Speaker Attribution with Frozen ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. |
Thai-Binh Nguyen; Alexander Waibel; | arxiv-cs.CL | 2024-11-27 |
27 | How to Learn A New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. |
SHIH-HENG WANG et. al. | arxiv-cs.SD | 2024-11-27 |
28 | Aligning Pre-trained Models for Spoken Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). |
Šimon Sedláček; Santosh Kesiraju; Alexander Polok; Jan Černocký; | arxiv-cs.CL | 2024-11-27 |
29 | AMPS: ASR with Multimodal Paraphrase Supervision Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. |
Amruta Parulekar; Abhishek Gupta; Sameep Chattopadhyay; Preethi Jyothi; | arxiv-cs.CL | 2024-11-27 |
30 | Scaling Speech-Text Pre-training with Synthetic Interleaved Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. |
AOHAN ZENG et. al. | arxiv-cs.CL | 2024-11-26 |
31 | Comparative Analysis of ASR Methods for Speech Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. |
DAVIDE SALVI et. al. | arxiv-cs.SD | 2024-11-26 |
32 | Hard-Synth: Synthesizing Diverse Hard Samples for ASR Using Zero-Shot TTS and LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Hard-Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. |
JIAWEI YU et. al. | arxiv-cs.CL | 2024-11-20 |
33 | Tiny-Align: Bridging Automatic Speech Recognition and Large Language Model on The Edge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a resource-efficient cross-modal alignment framework that bridges ASR and LLMs on edge devices to handle personalized audio input. |
RUIYANG QIN et. al. | arxiv-cs.SD | 2024-11-20 |
34 | CAFE A Novel Code Switching Dataset for Algerian Dialect French and English Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The paper introduces and publicly releases (Data download link available after acceptance) CAFE — the first Code-switching dataset between Algerian dialect, French, and english languages. |
HOUSSAM EDDINE-OTHMAN LACHEMAT et. al. | arxiv-cs.SD | 2024-11-20 |
35 | BanglaDialecto: An End-to-End AI-Powered Regional Speech Standardization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study presents an end-to-end pipeline for converting dialectal Noakhali speech to standard Bangla speech. |
MD. NAZMUS SADAT SAMIN et. al. | arxiv-cs.CL | 2024-11-16 |
36 | Interactive Cycle Model: The Linkage Combination Among Automatic Speech Recognition, Large Language Models and Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This research proposes the interaction loop model ASR-LLMs-Smart Glasses, which model combines automatic speech recognition, large language model and smart glasses to facilitate seamless human-computer interaction. |
Libo Wang; | arxiv-cs.HC | 2024-11-15 |
37 | Everyone Deserves Their Voice to Be Heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups. We used the moral framework of Weerts et al. (2022) to assess quality of service harms and fairness, and to provide a nuanced discussion on the implications of these biases, particularly for automatic subtitling. |
Rik Raes; Saskia Lensink; Mykola Pechenizkiy; | arxiv-cs.CL | 2024-11-14 |
38 | Optimizing Entity Resolution in Voice Interfaces: An ASR-Aware Entity Reference Expansion Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Navigating the equilibrium between accuracy and online retrieval’s speed requirement proves challenging, particularly when limited data links the failed mentions to resolved entities. In this paper, we propose a entity reference expansion system, injecting pairs of failed mentions and resolved entity names into the knowledge graph, enhancing its awareness of unresolved mentions. |
Jiangning Chen; Ziyun Zhang; Qianli Hu; | emnlp | 2024-11-11 |
39 | Task Arithmetic Can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods suffer in performance when they fine-tune an automatic speech recognition (ASR) model on synthetic data as they suffer from the distributional shift commonly referred to as the synthetic-to-real gap. In this paper, we find that task arithmetic is effective at mitigating this gap. |
Hsuan Su; Hua Farn; Fan-Yun Sun; Shang-Tse Chen; Hung-yi Lee; | emnlp | 2024-11-11 |
40 | BLSP-Emo: Towards Empathetic Large Speech-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. |
CHEN WANG et. al. | emnlp | 2024-11-11 |
41 | Advancing Test-Time Adaptation in Wild Acoustic Test Settings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a novel wild acoustic TTA method tailored for ASR fine-tuned acoustic foundation models. |
Hongfu Liu; Hengguan Huang; Ye Wang; | emnlp | 2024-11-11 |
42 | TokenVerse: Towards Unifying Speech and NLP Tasks Via Transducer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. |
SHASHI KUMAR et. al. | emnlp | 2024-11-11 |
43 | VHASR: A Multimodal Speech Recognition System With Vision Hotwords Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model’s speech recognition capability. |
JILIANG HU et. al. | emnlp | 2024-11-11 |
44 | Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. |
YEONJOON JUNG et. al. | emnlp | 2024-11-11 |
45 | What Is Lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. |
Kavya Manohar; Leena G Pillai; | emnlp | 2024-11-11 |
46 | Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. |
Giuseppe Attanasio; Beatrice Savoldi; Dennis Fucci; Dirk Hovy; | emnlp | 2024-11-11 |
47 | Mamba-based Decoder-Only Approach with Bidirectional Speech Modeling for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Selective state space models (SSMs) represented by Mamba have demonstrated their computational efficiency and promising outcomes in various tasks, including automatic speech recognition (ASR). |
Yoshiki Masuyama; Koichi Miyazaki; Masato Murata; | arxiv-cs.SD | 2024-11-11 |
48 | Continual Test-time Adaptation for End-to-end Speech Recognition on Noisy Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Following this framework, we introduce Dynamic SUTA (DSUTA), an entropy-minimization-based continual TTA method for ASR. |
Guan-Ting Lin; Wei Ping Huang; Hung-yi Lee; | emnlp | 2024-11-11 |
49 | Multistage Fine-tuning Strategies for Automatic Speech Recognition in Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this approach we aim to build ASR model for languages with limited digital resources by sequentially adapting the model across linguistically similar languages. |
Leena G Pillai; Kavya Manohar; Basil K Raju; Elizabeth Sherly; | arxiv-cs.CL | 2024-11-07 |
50 | Dialectal Coverage And Generalization in Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study explores three critical factors influencing ASR performance: the role of dialectal coverage in pre-training, the effectiveness of dialect-specific fine-tuning compared to a multi-dialectal approach, and the ability to generalize to unseen dialects. |
Amirbek Djanibekov; Hawau Olamide Toyin; Raghad Alshalan; Abdullah Alitr; Hanan Aldarmaki; | arxiv-cs.CL | 2024-11-07 |
51 | Enhancing AAC Software for Dysarthric Speakers in E-Health Settings: An Evaluation Using TORGO Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Prompt-overlap is a well-known issue with this dataset where phrases overlap between training and test speakers. Our work proposes an algorithm to break this prompt-overlap. |
Macarious Hui; Jinda Zhang; Aanchan Mohan; | arxiv-cs.CL | 2024-11-01 |
52 | Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach that first refines all available transcriptions to ensure data reliability. |
Enshi Zhang; Christian Poellabauer; | arxiv-cs.CL | 2024-10-27 |
53 | STTATTS: Unified Speech-To-Text And Text-To-Speech Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a parameter-efficient approach to learning ASR and TTS jointly via a multi-task learning objective and shared parameters. |
Hawau Olamide Toyin; Hao Li; Hanan Aldarmaki; | arxiv-cs.CL | 2024-10-24 |
54 | Evaluating and Improving Automatic Speech Recognition Systems for Korean Meteorological Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our contributions include creating a domain-specific dataset, comprehensive ASR model evaluations, and an effective augmentation technique. |
ChaeHun Park; Hojun Cho; Jaegul Choo; | arxiv-cs.CL | 2024-10-24 |
55 | MmWave-Whisper: Phone Call Eavesdropping and Transcription Using Millimeter-Wave Radar Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces mmWave-Whisper, a system that demonstrates the feasibility of full-corpus automated speech recognition (ASR) on phone calls eavesdropped remotely using off-the-shelf frequency modulated continuous wave (FMCW) millimeter-wave radars. |
Suryoday Basak; Abhijeeth Padarthi; Mahanth Gowda; | arxiv-cs.SD | 2024-10-22 |
56 | DENOASR: Debiasing ASRs Through Selective Denoising Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel framework DENOASR, which is a selective denoising technique to reduce the disparity in the word error rates between the two gender groups, male and female. |
Anand Kumar Rai; Siddharth D Jaiswal; Shubham Prakash; Bendi Pragnya Sree; Animesh Mukherjee; | arxiv-cs.SD | 2024-10-22 |
57 | VoiceBench: Benchmarking LLM-Based Voice Assistants Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field. |
YIMING CHEN et. al. | arxiv-cs.CL | 2024-10-22 |
58 | Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. |
Abhishek Gupta; Amruta Parulekar; Sameep Chattopadhyay; Preethi Jyothi; | arxiv-cs.CL | 2024-10-17 |
59 | Investigation of Speaker Representation for Target-Speaker Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While most studies have focused on training schemes or system architectures for each specific task, the auxiliary network for embedding target-speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper aims to address a fundamental question: what is the preferred speaker embedding for TS tasks? |
TAKANORI ASHIHARA et. al. | arxiv-cs.SD | 2024-10-14 |
60 | Automatic Speech Recognition with BERT and CTC Transformers: A Review Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: All in all, this review provides valuable insights for researchers and practitioners who are interested in ASR with BERT and CTC transformers. |
Noussaiba Djeffal; Hamza Kheddar; Djamel Addou; Ahmed Cherif Mazari; Yassine Himeur; | arxiv-cs.CL | 2024-10-12 |
61 | Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To develop Indonesian automatic speech recognition (ASR), we present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper, as well as compiling a dataset comprising Indonesian speech with variabilities to facilitate our study. |
AULIA ADILA et. al. | arxiv-cs.CL | 2024-10-11 |
62 | Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we hypothesize that incorporating speaker representations during speech recognition can enhance model robustness to noise. |
Sagarika Alavilli; Annesya Banerjee; Gasser Elbanna; Annika Magaro; | arxiv-cs.SD | 2024-10-07 |
63 | Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. |
HEESEUNG KIM et. al. | nips | 2024-10-07 |
64 | REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose REBORN, Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR. |
LIANG-HSUAN TSENG et. al. | nips | 2024-10-07 |
65 | Comprehensive Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for The Polish Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A comprehensive framework has been designed to survey, catalog, and curate available speech datasets, which allows replicable evaluation of automatic speech recognition (ASR) systems. |
Michał Junczyk; | nips | 2024-10-07 |
66 | Context and System Fusion in Post-ASR Emotion Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Large language models (LLMs) have started to play a vital role in modelling speech and text. |
Pavel Stepachev; Pinzhen Chen; Barry Haddow; | arxiv-cs.CL | 2024-10-04 |
67 | Reverb: Open-Source ASR and Diarization from Rev Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as … |
NISHCHAL BHANDARI et. al. | arxiv-cs.CL | 2024-10-04 |
68 | Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). |
Olga Iakovenko; Ivan Bondarenko; | arxiv-cs.SD | 2024-10-03 |
69 | Algorithms For Automatic Accentuation And Transcription Of Russian Texts In Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. |
Olga Iakovenko; Ivan Bondarenko; Mariya Borovikova; Daniil Vodolazsky; | arxiv-cs.CL | 2024-10-03 |
70 | Automatic Speech Recognition for The Ika Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a cost-effective approach for developing Automatic Speech Recognition (ASR) models for low-resource languages like Ika. |
Uchenna Nzenwata; Daniel Ogbuigwe; | arxiv-cs.CL | 2024-10-01 |
71 | Recent Advances in Speech Language Models: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. |
WENQIAN CUI et. al. | arxiv-cs.CL | 2024-10-01 |
72 | AfriHuBERT: A Self-supervised Speech Representation Model for African Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. |
Jesujoba O. Alabi; Xuechen Liu; Dietrich Klakow; Junichi Yamagishi; | arxiv-cs.CL | 2024-09-30 |
73 | ChildMandarin: A Comprehensive Mandarin Speech Dataset for Young Children Aged 3-5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, developing robust ASR models for young children’s speech remains challenging due to differences in pronunciation, tone, and pace compared to adult speech. In this paper, we introduce a new Mandarin speech dataset focused on children aged 3 to 5, addressing the scarcity of resources in this area. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2024-09-27 |
74 | Improving Multilingual ASR in The Wild Using Simple N-best Re-ranking Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy for several prominent acoustic models by employing external features such as language models and text-based language identification models. |
Brian Yan; Vineel Pratap; Shinji Watanabe; Michael Auli; | arxiv-cs.CL | 2024-09-26 |
75 | Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: These models often rely on an ASR-to-TTS chain-of-thought pipeline, converting speech into text for processing before generating audio responses, which introduces latency and loses audio features. We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities. |
Robin Shing-Hei Yuen; Timothy Tin-Long Tse; Jian Zhu; | arxiv-cs.CL | 2024-09-25 |
76 | Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a novel application of weighted cross-entropy, typically used for unbalanced datasets, to facilitate the integration of low-resource languages into pre-trained multilingual ASR models within the context of continual multilingual learning. |
Andrés Piñeiro-Martín; Carmen García-Mateo; Laura Docío-Fernández; María del Carmen López-Pérez; Georg Rehm; | arxiv-cs.CL | 2024-09-25 |
77 | Spelling Correction Through Rewriting of Non-Autoregressive ASR Lattices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models. |
LEONID VELIKOVICH et. al. | arxiv-cs.CL | 2024-09-24 |
78 | Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). |
FENGRUN ZHANG et. al. | arxiv-cs.SD | 2024-09-24 |
79 | Bridging Speech and Text: Enhancing ASR with Pinyin-to-Character Pre-training in LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel training approach to enhance LLM performance in ASR tasks. |
Yang Yuhang; Peng Yizhou; Eng Siong Chng; Xionghu Zhong; | arxiv-cs.CL | 2024-09-24 |
80 | MultiMed: Multilingual Medical Speech Recognition Via Attention Encoder Decoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. |
KHAI LE-DUC et. al. | arxiv-cs.CL | 2024-09-21 |
81 | Fast Streaming Transducer ASR Prototyping Via Knowledge Distillation with Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch in consumer and accessible GPUs in their entirety with pseudo-labeled (PL) speech from foundational speech models (FSM). |
IULIIA THORBECKE et. al. | arxiv-cs.CL | 2024-09-20 |
82 | A Multimodal Dense Retrieval Approach for Speech-Based Open-Domain Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, the ASR model propagates its errors to the retriever. In this work, we try to alleviate these limitations by proposing an ASR-free, end-to-end trained multimodal dense retriever that can work directly on spoken questions. |
Georgios Sidiropoulos; Evangelos Kanoulas; | arxiv-cs.CL | 2024-09-20 |
83 | Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. |
Sebastião Quintas; Isabelle Ferrané; Thomas Pellegrini; | arxiv-cs.SD | 2024-09-19 |
84 | Personalized Speech Recognition for Children with Test-Time Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We devised a novel ASR pipeline to apply unsupervised test-time adaptation (TTA) methods for child speech recognition, so that ASR models pre-trained on adult speech can be continuously adapted to each child speaker at test time without further human annotations. |
Zhonghao Shi; Harshvardhan Srivastava; Xuan Shi; Shrikanth Narayanan; Maja J. Matarić; | arxiv-cs.LG | 2024-09-19 |
85 | ASR Benchmarking: Need for A More Representative Conversational Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. |
Gaurav Maheshwari; Dmitry Ivanov; Théo Johannet; Kevin El Haddad; | arxiv-cs.CL | 2024-09-18 |
86 | Large Language Models Are Strong Audio-Visual Speech Recognition Learners Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On the contrary, tasks like visual and audio-visual speech recognition (VSR/AVSR), which also exploit noise-invariant lip movement information, have received little or no attention. To bridge this gap, we propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities. |
UMBERTO CAPPELLAZZO et. al. | arxiv-cs.CV | 2024-09-18 |
87 | M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose M2R-whisper, a novel multi-stage and multi-scale retrieval augmentation approach designed to enhance ASR performance in low-resource settings. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2024-09-18 |
88 | Channel-Aware Domain-Adaptive Generative Adversarial Network for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While pre-trained automatic speech recognition (ASR) systems demonstrate impressive performance on matched domains, their performance often degrades when confronted with channel mismatch stemming from unseen recording environments and conditions. To mitigate this issue, we propose a novel channel-aware data simulation method for robust ASR training. |
CHIEN-CHUN WANG et. al. | arxiv-cs.SD | 2024-09-18 |
89 | Simulating Native Speaker Shadowing for Nonnative Speech Assessment with Latent Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a speech generation system that simulates the L1 shadowing process using voice conversion (VC) techniques and latent speech representations. |
Haopeng Geng; Daisuke Saito; Nobuaki Minematsu; | arxiv-cs.SD | 2024-09-18 |
90 | WER We Stand: Benchmarking Urdu ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. |
SAMEE ARIF et. al. | arxiv-cs.CL | 2024-09-17 |
91 | Chain-of-Thought Prompting for Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. |
KE HU et. al. | arxiv-cs.CL | 2024-09-17 |
92 | Speech Recognition for Analysis of Police Radio Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate the performance of off-the-shelf speech recognizers, models fine-tuned on BPC data, and customized end-to-end models. We find that both human and machine transcription is challenging in this domain. |
Tejes Srivastava; Ju-Chieh Chou; Priyank Shroff; Karen Livescu; Christopher Graziul; | arxiv-cs.SD | 2024-09-16 |
93 | Augmenting Automatic Speech Recognition Models with Disfluency Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. |
Robin Amann; Zhaolin Li; Barbara Bruno; Jan Niehues; | arxiv-cs.CL | 2024-09-16 |
94 | Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. |
CHAO-HAN HUCK YANG et. al. | arxiv-cs.CL | 2024-09-15 |
95 | Exploring SSL Discrete Tokens for Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. |
MINGYU CUI et. al. | arxiv-cs.CL | 2024-09-13 |
96 | LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. To address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR. |
SHAOJUN LI et. al. | arxiv-cs.SD | 2024-09-13 |
97 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. |
LINGWEI MENG et. al. | arxiv-cs.CL | 2024-09-13 |
98 | M$^{3}$V: A Multi-modal Multi-view Approach for Device-Directed Speech Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR). To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. |
ANNA WANG et. al. | arxiv-cs.SD | 2024-09-13 |
99 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-09-12 |
100 | WhisperNER: Unified Open Named Entity and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. |
GIL AYACHE et. al. | arxiv-cs.CL | 2024-09-12 |
101 | The Faetar Benchmark: Speech Recognition in A Very Under-Resourced Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. |
MICHAEL ONG et. al. | arxiv-cs.CL | 2024-09-12 |
102 | Enhancing CTC-Based Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). |
Hendrik Laux; Anke Schmeink; | arxiv-cs.CV | 2024-09-11 |
103 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. |
Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Batthacharya; | arxiv-cs.SD | 2024-09-11 |
104 | Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. |
Jihyun Lee; Solee Im; Wonjun Lee; Gary Geunbae Lee; | arxiv-cs.CL | 2024-09-10 |
105 | What Is Lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially improved performance metrics for Indic languages. |
Kavya Manohar; Leena G Pillai; Elizabeth Sherly; | arxiv-cs.CL | 2024-09-04 |
106 | Quantification of Stylistic Differences in Human- and ASR-produced Transcripts of African American English Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). |
ANNIKA HEUSER et. al. | arxiv-cs.CL | 2024-09-04 |
107 | Comparing Discrete and Continuous Space LLMs for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. |
Yaoxun Xu; Shi-Xiong Zhang; Jianwei Yu; Zhiyong Wu; Dong Yu; | arxiv-cs.CL | 2024-09-01 |
108 | Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. |
Hao Shi; Yuan Gao; Zhaoheng Ni; Tatsuya Kawahara; | arxiv-cs.SD | 2024-09-01 |
109 | LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. |
ZENGRUI JIN et. al. | arxiv-cs.SD | 2024-09-01 |
110 | ProGRes: Prompted Generative Rescoring on ASR N-Best Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. |
Ada Defne Tur; Adel Moumen; Mirco Ravanelli; | arxiv-cs.CL | 2024-08-30 |
111 | Measuring The Accuracy of Automatic Speech Recognition Solutions IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. |
Korbinian Kuhn; Verena Kersken; Benedikt Reuter; Niklas Egger; Gottfried Zimmermann; | arxiv-cs.CL | 2024-08-29 |
112 | Speech Recognition Transformers: Topological-lingualism Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The paper presents a comprehensive survey of transformer techniques oriented in speech modality. |
Shruti Singh; Muskaan Singh; Virender Kadyan; | arxiv-cs.CL | 2024-08-27 |
113 | Self-supervised Speech Representations Still Struggle with African American Vernacular English Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. |
KALVIN CHANG et. al. | arxiv-cs.CL | 2024-08-26 |
114 | MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). |
KULUHAN BINICI et. al. | arxiv-cs.CL | 2024-08-26 |
115 | Developing Vocal System Impaired Patient-aimed Voice Quality Assessment Approach Using ASR Representation-included Multiple Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. |
SHAOXIANG DANG et. al. | arxiv-cs.SD | 2024-08-22 |
116 | Towards Measuring Fairness in Speech Recognition: Fair-Speech Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. |
IRINA-ELENA VELICHE et. al. | arxiv-cs.AI | 2024-08-22 |
117 | The State of Commercial Automatic French Legal Speech Recognition Systems and Their Impact on Court Reporters Et Al Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We benchmark three ASR models, including commercial and open-source options, on their ability to recognize French legal speech using a curated dataset. Our study evaluates the performance of these systems using the Word Error Rate (WER) metric and introduces the Sonnex Distance to account for phonetic accuracy. |
Nicolad Garneau; Olivier Bolduc; | arxiv-cs.CL | 2024-08-21 |
118 | Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. |
Prashant Serai; Peidong Wang; Eric Fosler-Lussier; | arxiv-cs.AI | 2024-08-20 |
119 | Error-preserving Automatic Speech Recognition of Young English Learners� Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the mistakes made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their mistakes. |
JANICK MICHOT et. al. | acl | 2024-08-20 |
120 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; J�r�me Goulian; Benjamin Lecouteux; | acl | 2024-08-20 |
121 | StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. |
SHAOLEI ZHANG et. al. | acl | 2024-08-20 |
122 | CopyNE: Better Contextual ASR By Copying Named Entities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we treat entities as indivisible wholes and introduce the idea of copying into ASR. |
SHILIN ZHOU et. al. | acl | 2024-08-20 |
123 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn�t Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | acl | 2024-08-20 |
124 | XCB: An Effective Contextual Biasing Approach to Bias Cross-lingual Phrases in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. |
Xucheng Wan; Naijun Zheng; Kai Liu; Huan Zhou; | arxiv-cs.CL | 2024-08-20 |
125 | A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. |
YANGZE LI et. al. | arxiv-cs.SD | 2024-08-18 |
126 | Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and … |
Yinghao Aaron Li; Xilin Jiang; Jordan Darefsky; Ge Zhu; N. Mesgarani; | ArXiv | 2024-08-13 |
127 | Enhancing Dialogue Speech Recognition with Robust Contextual Awareness Via Noise Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. |
Wonjun Lee; San Kim; Gary Geunbae Lee; | arxiv-cs.CL | 2024-08-12 |
128 | Audio Enhancement for Computer Audition — An Iterative Training Paradigm Using Sample Importance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. |
Manuel Milling; Shuo Liu; Andreas Triantafyllopoulos; Ilhan Aslan; Björn W. Schuller; | arxiv-cs.SD | 2024-08-12 |
129 | LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. |
Eunseop Yoon; Hee Suk Yoon; John Harvill; Mark Hasegawa-Johnson; Chang D. Yoo; | arxiv-cs.CL | 2024-08-11 |
130 | MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. |
JUNHAO XU et. al. | arxiv-cs.CL | 2024-08-09 |
131 | Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present accent clustering and mining schemes for fair speech recognition systems which can perform equally well on under-represented accented speech. |
JAEYOUNG KIM et. al. | arxiv-cs.SD | 2024-08-05 |
132 | Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we introduce a novel framework that diverges from typical second-pass rescoring methods. |
Yixuan Tang; Anthony K. H. Tung; | ijcai | 2024-08-03 |
133 | ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms Using Linguistic Features Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. |
PENG CHENG et. al. | arxiv-cs.CR | 2024-08-03 |
134 | MECOS: A Bilingual Manipuri-English Spontaneous Code-switching Speech Corpus for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View |
Naorem Karline Singh; Y. J. Chanu; Hoomexsun Pangsatabam; | Comput. Speech Lang. | 2024-08-01 |
135 | On The Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use the comparison of five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training. |
Nick Rossenbach; Ralf Schlüter; Sakriani Sakti; | arxiv-cs.CL | 2024-07-31 |
136 | Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. |
KOHEI MATSUURA et. al. | arxiv-cs.CL | 2024-07-31 |
137 | Improving Noisy Student Training for Low-resource Languages in End-to-End ASR Using CycleGAN and Inter-domain Losses Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial … |
C. Li; Ngoc Thang Vu; | ArXiv | 2024-07-26 |
138 | Dynamic Language Group-Based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work proposes DLG-MoE, a Dynamic Language Group-based MoE, which can effectively handle the CS-ASR task and leverage the advantages of parameter scaling. |
HUKAI HUANG et. al. | arxiv-cs.CL | 2024-07-26 |
139 | Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. |
Jiwon Suh; Injae Na; Woohwan Jung; | arxiv-cs.CL | 2024-07-25 |
140 | On The Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). |
Benedikt Hilmes; Nick Rossenbach; and Ralf Schlüter; | arxiv-cs.CL | 2024-07-25 |
141 | Coupling Speech Encoders with Downstream Text Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task. |
Ciprian Chelba; Johan Schalkwyk; | arxiv-cs.CL | 2024-07-24 |
142 | Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern … |
Rithik Sachdev; Zhong-Qiu Wang; Chao-Han Huck Yang; | arxiv-cs.CL | 2024-07-23 |
143 | Quantifying The Role of Textual Predictability in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic–phonetic modelling. |
Sean Robertson; Gerald Penn; Ewan Dunbar; | arxiv-cs.CL | 2024-07-23 |
144 | DMel: Speech Tokenization Made Simple Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Using an LM-style transformer architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR) and speech synthesis (TTS). |
HE BAI et. al. | arxiv-cs.CL | 2024-07-22 |
145 | SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. |
Hazim Bukhari; Soham Deshmukh; Hira Dhamyal; Bhiksha Raj; Rita Singh; | arxiv-cs.SD | 2024-07-21 |
146 | Low-Resourced Speech Recognition for Iu Mien Language Via Weakly-Supervised Phoneme-based Multilingual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. |
LUKUAN DONG et. al. | arxiv-cs.SD | 2024-07-18 |
147 | Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding By Provenance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic speech recognition (ASR) models trained on large amounts of audio data are now widely used to convert speech to written text in a variety of applications from video captioning to automated assistants used in healthcare and other domains. |
Changye Li; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-07-18 |
148 | Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present novel approaches that use a generative pretrained transformer (GPT) to identify paraphasias from transcripts as well as two end-to-end approaches that focus on modeling both automatic speech recognition (ASR) and paraphasia classification as multiple sequences vs. a single sequence. |
Matthew Perez; Aneesha Sampath; Minxue Niu; Emily Mower Provost; | arxiv-cs.CL | 2024-07-15 |
149 | Textless Dependency Parsing By Labeled Sequence Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although their effectiveness is shown in capturing acoustic features, it is unclear in capturing lexical knowledge. This paper proposes a textless method for dependency parsing, examining its effectiveness and limitations. |
Shunsuke Kando; Yusuke Miyao; Jason Naradowsky; Shinnosuke Takamichi; | arxiv-cs.CL | 2024-07-14 |
150 | CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer Based Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. |
Wenbo Zhao; Ziwei Li; Chuan Yu; Zhijian Ou; | arxiv-cs.SD | 2024-07-14 |
151 | Empowering Whisper As A Joint Multi-Talker and Target-Talker Speech Recognition System Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. |
LINGWEI MENG et. al. | arxiv-cs.SD | 2024-07-13 |
152 | HebDB: A Weakly Supervised Dataset for Hebrew Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. |
ARNON TURETZKY et. al. | arxiv-cs.CL | 2024-07-10 |
153 | LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner’s Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. |
HAECHAN KIM et. al. | arxiv-cs.CL | 2024-07-05 |
154 | Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. |
Vyas Raina; Mark Gales; | arxiv-cs.SD | 2024-07-05 |
155 | TokenVerse: Towards Unifying Speech and NLP Tasks Via Transducer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. |
SHASHI KUMAR et. al. | arxiv-cs.CL | 2024-07-05 |
156 | Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study yields numerous significant findings that we are discussing in this paper. |
Salima Mdhaffar; Haroun Elleuch; Fethi Bougares; Yannick Estève; | arxiv-cs.CL | 2024-07-05 |
157 | Romanization Encoding For Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. |
WEN DING et. al. | arxiv-cs.CL | 2024-07-05 |
158 | FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). |
KEYU AN et. al. | arxiv-cs.SD | 2024-07-04 |
159 | Improving Accented Speech Recognition Using Data Augmentation Based on Unsupervised Text-to-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. |
Cong-Thanh Do; Shuhei Imai; Rama Doddipatla; Thomas Hain; | arxiv-cs.CL | 2024-07-04 |
160 | Improving Self-supervised Pre-training Using Accent-Specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose an accent-aware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. |
Darshan Prabhu; Abhishek Gupta; Omkar Nitsure; Preethi Jyothi; Sriram Ganapathy; | arxiv-cs.CL | 2024-07-04 |
161 | Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. |
Tiia Sildam; Andra Velve; Tanel Alumäe; | arxiv-cs.CL | 2024-07-04 |
162 | Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. |
Jinming Chen; Jingyi Fang; Yuanzhong Zheng; Yaoxuan Wang; Haojun Fei; | arxiv-cs.SD | 2024-07-03 |
163 | Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-07-01 |
164 | PaSCoNT – Parallel Speech Corpus of Northern-central Thai for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View |
SUPAWAT TAERUNGRUANG et. al. | Comput. Speech Lang. | 2024-07-01 |
165 | Less Is More: Accurate Speech Recognition & Translation Without Web-Scale Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that state-of-the art accuracy can be reached without relying on web-scale data. |
KRISHNA C. PUVVADA et. al. | arxiv-cs.CL | 2024-06-28 |
166 | Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. |
ZHENG FANG et. al. | arxiv-cs.CR | 2024-06-27 |
167 | Enhanced ASR Robustness to Packet Loss with A Front-End Adaptation Network Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose using a front-end adaptation network connected to a frozen ASR model. |
Yehoshua Dissen; Shiry Yonash; Israel Cohen; Joseph Keshet; | arxiv-cs.SD | 2024-06-27 |
168 | ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. |
Ahmed Heakl; Youssef Zaghloul; Mennatullah Ali; Rania Hossam; Walid Gomaa; | arxiv-cs.CL | 2024-06-26 |
169 | Automatic Speech Recognition for Hindi Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations. |
Anish Saha; A. G. Ramakrishnan; | arxiv-cs.CL | 2024-06-26 |
170 | Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. |
Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie; | arxiv-cs.SD | 2024-06-26 |
171 | Dynamic Data Pruning for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. |
QIAO XIAO et. al. | arxiv-cs.CL | 2024-06-26 |
172 | MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. |
ADRIANA FERNANDEZ-LOPEZ et. al. | arxiv-cs.CV | 2024-06-25 |
173 | A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. |
VAN TUNG PHAM et. al. | arxiv-cs.LG | 2024-06-25 |
174 | SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. |
Shuaishuai Ye; Shunfei Chen; Xinhui Hu; Xinkang Xu; | arxiv-cs.SD | 2024-06-25 |
175 | FASA: A Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children’s speech data from many of the existing noisy children’s speech data. |
Dancheng Liu; Jinjun Xiong; | arxiv-cs.CL | 2024-06-25 |
176 | Sequential Editing for Lifelong Training of Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. |
Devang Kulshreshtha; Saket Dingliwal; Brady Houston; Nikolaos Pappas; Srikanth Ronanki; | arxiv-cs.CL | 2024-06-25 |
177 | Exploring The Capability of Mamba in Speech Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. |
Koichi Miyazaki; Yoshiki Masuyama; Masato Murata; | arxiv-cs.SD | 2024-06-24 |
178 | Blending LLMs Into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present KIT’s offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. |
SAI KONERU et. al. | arxiv-cs.CL | 2024-06-24 |
179 | Perception of Phonological Assimilation By Neural Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). |
Charlotte Pouw; Marianne de Heer Kloots; Afra Alishahi; Willem Zuidema; | arxiv-cs.CL | 2024-06-21 |
180 | PI-Whisper: An Adaptive and Incremental ASR Framework for Diverse and Evolving Speaker Characteristics Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: As edge-based automatic speech recognition (ASR) technologies become increasingly prevalent for the development of intelligent and personalized assistants, three important … |
Amir Nassereldine; Dancheng Liu; Chenhui Xu; Jinjun Xiong; | ArXiv | 2024-06-21 |
181 | PI-Whisper: Designing An Adaptive and Incremental Automatic Speech Recognition System for Edge Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we show how the design of PI-Whisper allows for incremental adaptation of new characteristics without the need for repetitive retraining, enhances recognition capabilities, and improves equity and fairness across diverse speaker groups. |
AMIR NASSERELDINE et. al. | arxiv-cs.CL | 2024-06-21 |
182 | Massive End-to-end Speech Recognition Models with Time Reduction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate massive end-to-end automatic speech recognition (ASR) models with efficiency improvements achieved by time reduction. |
WEIRAN WANG et. al. | naacl | 2024-06-20 |
183 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | naacl | 2024-06-20 |
184 | Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. |
Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Neeraj Gaur; Zhong Meng; | arxiv-cs.AI | 2024-06-20 |
185 | Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. |
Suyoung Kim; Jiyeon Hwang; Ho-Young Jung; | naacl | 2024-06-20 |
186 | ManWav: The First Manchu ASR Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. |
Jean Seo; Minha Kang; Sungjoo Byun; Sangah Lee; | arxiv-cs.CL | 2024-06-19 |
187 | Joint Vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. |
Alexander Blatt; Aravind Krishnan; Dietrich Klakow; | arxiv-cs.CL | 2024-06-19 |
188 | Children’s Speech Recognition Through Discrete Token Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we investigate the integration of discrete speech tokens into children’s speech recognition systems as input without significantly degrading the ASR performance. |
Vrunda N. Sukhadia; Shammur Absar Chowdhury; | arxiv-cs.CL | 2024-06-19 |
189 | Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. |
Hayato Futami; Siddhant Arora; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe; | arxiv-cs.CL | 2024-06-18 |
190 | Bridging The Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. |
KUAN-CHEN WANG et. al. | arxiv-cs.SD | 2024-06-18 |
191 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; Jérôme Goulian; Benjamin Lecouteux; | arxiv-cs.CL | 2024-06-18 |
192 | CoSTA: Code-Switched Speech Translation Using Aligned Speech-Text Interleaving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. |
Bhavani Shankar; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2024-06-16 |
193 | Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. |
Wenhan Yao; Jiangkun Yang; Yongqiang He; Jia Liu; Weiping Wen; | arxiv-cs.SD | 2024-06-16 |
194 | Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper’s cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. |
Haoyu Wang; Guoqiang Hu; Guodong Lin; Wei-Qiang Zhang; Jian Li; | arxiv-cs.SD | 2024-06-14 |
195 | An Efficient Text Augmentation Approach for Contextualized Mandarin Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. |
Naijun Zheng; Xucheng Wan; Kai Liu; Ziqing Du; Zhou Huan; | arxiv-cs.SD | 2024-06-14 |
196 | Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We introduce Speech ReaLLM, a new ASR architecture that marriesdecoder-onlyASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the … |
FRANK SEIDE et. al. | ArXiv | 2024-06-13 |
197 | Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a transcription-free method for joint training using only audio signals. |
WILLIAM RAVENSCROFT et. al. | arxiv-cs.SD | 2024-06-13 |
198 | LASER: Learning By Aligning Self-supervised Representations of Speech for Improving Content-related Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named LASER: Learning by Aligning Self-supervised Representations is presented. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-06-13 |
199 | Speech ReaLLM — Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Speech ReaLLM, a new ASR architecture that marries decoder-only ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. |
FRANK SEIDE et. al. | arxiv-cs.CL | 2024-06-13 |
200 | EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. |
ZIYANG ZHUANG et. al. | arxiv-cs.SD | 2024-06-13 |
201 | The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The DIarization of SPeaker and LAnguage in Conversational Environments (DISPLACE) 2024 challenge is the second in the series of DISPLACE challenges, which involves tasks of … |
SHAREEF BABU KALLURI et. al. | ArXiv | 2024-06-13 |
202 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | arxiv-cs.CL | 2024-06-13 |
203 | ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. |
JIATONG SHI et. al. | arxiv-cs.SD | 2024-06-12 |
204 | Improving Child Speech Recognition with Augmented Child-like Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied … |
Yuanyuan Zhang; Zhengjun Yue; T. Patel; O. Scharenborg; | ArXiv | 2024-06-12 |
205 | Training Data Augmentation for Dysarthric Automatic Speech Recognition By Text-to-Dysarthric-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. |
Wing-Zin Leung; Mattias Cross; Anton Ragni; Stefan Goetze; | arxiv-cs.SD | 2024-06-12 |
206 | Towards Unsupervised Speech Recognition Without Pronunciation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. |
JUNRUI NI et. al. | arxiv-cs.CL | 2024-06-12 |
207 | The Interspeech 2024 Challenge on Speech Processing Using Discrete Units Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in this evolving field. |
XUANKAI CHANG et. al. | arxiv-cs.SD | 2024-06-11 |
208 | PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. |
TRANG LE et. al. | arxiv-cs.CL | 2024-06-11 |
209 | AS-70: A Mandarin Stuttered Speech Dataset for Automatic Speech Recognition and Stuttering Event Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. |
RONG GONG et. al. | arxiv-cs.SD | 2024-06-11 |
210 | Reading Miscue Detection in Primary School Through Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition (PER at 23.1\%), while Whisper (Faster Whisper Large-v2) achieves SOTA word-level performance (WER at 9.8\%). |
Lingyun Gao; Cristian Tejedor-Garcia; Helmer Strik; Catia Cucchiarini; | arxiv-cs.CL | 2024-06-11 |
211 | MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling Methods for Learning Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. |
Hemant Yadav; Sunayana Sitaram; Rajiv Ratn Shah; | arxiv-cs.CL | 2024-06-09 |
212 | LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of … |
ZHESHU SONG et. al. | ArXiv | 2024-06-07 |
213 | Improving Zero-Shot Chinese-English Code-Switching ASR with KNN-CTC and Gated Monolingual Datastores Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. |
JIAMING ZHOU et. al. | arxiv-cs.CL | 2024-06-06 |
214 | Hypernetworks for Personalizing ASR to Atypical Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Parameter-efficient fine-tuning (PEFT) for personalizing automatic speech recognition (ASR) has recently shown promise for adapting general population models to atypical speech. |
Max Müller-Eberstein; Dianna Yee; Karren Yang; Gautam Varma Mantena; Colin Lea; | arxiv-cs.LG | 2024-06-06 |
215 | Error-preserving Automatic Speech Recognition of Young English Learners’ Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. |
JANICK MICHOT et. al. | arxiv-cs.CL | 2024-06-05 |
216 | Text Injection for Neural Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes contextual text injection (CTI) to enhance contextual ASR. |
ZHONG MENG et. al. | arxiv-cs.CL | 2024-06-05 |
217 | Discrete Multimodal Transformers with A Pretrained Large Language Model for Mixed-Supervision Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). |
VIET ANH TRINH et. al. | arxiv-cs.CL | 2024-06-04 |
218 | Efficiently Train ASR Models That Memorize Less and Perform Better with Per-core Clipping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. |
LUN WANG et. al. | arxiv-cs.CR | 2024-06-04 |
219 | Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition Via Weakly Phonetic Supervision Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. |
Saierdaer Yusuyin; Te Ma; Hao Huang; Wenbo Zhao; Zhijian Ou; | arxiv-cs.SD | 2024-06-04 |
220 | Speaking of Accent: A Content Analysis of Accent Misconceptions in ASR Research Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) researchers are working to address the differing transcription performance of ASR by accent or dialect. However, research often has a limited … |
Kerri Prinos; Neal Patwari; Cathleen A. Power; | Proceedings of the 2024 ACM Conference on Fairness, … | 2024-06-03 |
221 | Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. |
Ara Yeroyan; Nikolay Karpov; | arxiv-cs.CL | 2024-06-03 |
222 | Pass The Butter: A Study on Desktop-classic Multitasking Robotic Arm Based on Advanced YOLOv7 and BERT Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to meet the current societal demand for service robot technology, this study proposes using a miniaturized desktop-level robot (by ROS) as a carrier, locally deploying a natural language model (NLP-BERT), and integrating visual recognition (CV-YOLO) and speech recognition technology (ASR-Whisper) as inputs to achieve autonomous decision-making and rational action by the desktop robot. |
HAOHUA QUE et. al. | arxiv-cs.RO | 2024-05-27 |
223 | Denoising LM: Pushing The Limits of Error Correction Models for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. |
ZIJIN GU et. al. | arxiv-cs.LG | 2024-05-24 |
224 | Let’s Fuse Step By Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Generative Fusion Decoding (GFD), a novel shallow fusion framework, utilized to integrate Large Language Models (LLMs) into multi-modal text recognition systems such as automatic speech recognition (ASR) and optical character recognition (OCR). |
CHAN-JAN HSU et. al. | arxiv-cs.CL | 2024-05-23 |
225 | You Don’t Understand Me!: Comparing ASR Results for L1 and L2 Speakers of Swedish IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. |
Ronald Cumbal; Birger Moell; Jose Lopes; Olof Engwall; | arxiv-cs.CL | 2024-05-22 |
226 | A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot’s ego speech using only a single-channel microphone. |
Yue Li; Florian A. Kunneman; Koen V. Hindriks; | arxiv-cs.HC | 2024-05-22 |
227 | Linguistic Analysis of Human-computer Interaction Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This article reviews recent literature investigating speech variation in production and comprehension during spoken language communication between humans and devices. Human speech … |
Georgia Zellou; Nicole Holliday; | Frontiers Comput. Sci. | 2024-05-21 |
228 | Non-autoregressive Real-time Accent Conversion Model with Voice Cloning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We have developed the non-autoregressive model for real-time accent conversion with voice cloning. |
Vladimir Nechaev; Sergey Kosyakov; | arxiv-cs.SD | 2024-05-21 |
229 | A Study on Speech Recognition By A Neural Network Based on English Speech Feature Parameters Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this study, from the perspective of English speech feature parameters, two feature parameters, the mel-frequency cepstral coefficient (MFCC) and filter bank (Fbank), were … |
Congmin Mao; Sujing Liu; | J. Adv. Comput. Intell. Intell. Informatics | 2024-05-20 |
230 | Listen Again and Choose The Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. |
YUCHEN HU et. al. | arxiv-cs.CL | 2024-05-16 |
231 | Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. … |
Ahmed Adel Attia; Dorottya Demszky; Tolúlopé Ògúnrèmí; Jing Liu; Carol Y. Espy-Wilson; | ArXiv | 2024-05-15 |
232 | Towards Evaluating The Robustness of Automatic Speech Recognition Systems Via Audio Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an attack on ASR systems based on user-customized style transfer. |
WEIFEI JIN et. al. | arxiv-cs.SD | 2024-05-15 |
233 | I Know What You Mean: Context-Aware Recognition to Enhance Speech-Based Games Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent advances in language processing and speech recognition open up a large opportunity for video game companies to embrace voice interaction as an intuitive feature and … |
Nima Zargham; Mohamed Lamine Fetni; Laura Spillner; Thomas Muender; Rainer Malaka; | Proceedings of the CHI Conference on Human Factors in … | 2024-05-11 |
234 | Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a simple yet effective method to learn a universal acoustic realization of Whisper’s $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting’ the model. |
Vyas Raina; Rao Ma; Charles McGhee; Kate Knill; Mark Gales; | arxiv-cs.CL | 2024-05-09 |
235 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | arxiv-cs.CL | 2024-05-09 |
236 | The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. |
JINGGUANG TIAN et. al. | arxiv-cs.SD | 2024-05-08 |
237 | Open Implementation and Study of BEST-RQ for Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. |
Ryan Whetten; Titouan Parcollet; Marco Dinarelli; Yannick Estève; | arxiv-cs.CL | 2024-05-07 |
238 | Mixat: A Data Set of Bilingual Emirati-English Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. |
Maryam Al Ali; Hanan Aldarmaki; | arxiv-cs.CL | 2024-05-04 |
239 | Unveiling The Potential of LLM-Based ASR on Chinese Open-Source Datasets Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. |
XUELONG GENG et. al. | arxiv-cs.SD | 2024-05-03 |
240 | Integrated End-to-End Automatic Speech Recognition for Languages for Agglutinative Languages Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new … |
A. Bekarystankyzy; O. Mamyrbayev; Tolganay Anarbekova; | ACM Transactions on Asian and Low-Resource Language … | 2024-05-03 |
241 | Towards Fair and Inclusive Speech Recognition for Stuttering: Community-led Chinese Stuttered Speech Dataset Creation and Benchmarking Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Despite the widespread adoption of Automatic Speech Recognition (ASR) models in voice-operated products and conversational AI agents, current ASR models perform poorly for people … |
Qisheng Li; Shaomei Wu; | Extended Abstracts of the CHI Conference on Human Factors … | 2024-05-02 |
242 | Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. |
FRANCISCO TEIXEIRA et. al. | arxiv-cs.LG | 2024-05-02 |
243 | Low-resource Speech Recognition and Dialect Identification of Irish in A Multi-task Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). |
Liam Lonergan; Mengjie Qian; Neasa Ní Chiaráin; Christer Gobl; Ailbhe Ní Chasaide; | arxiv-cs.CL | 2024-05-02 |
244 | Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. |
Dongyuan Li; Ying Zhang; Yusong Wang; Funakoshi Kataro; Manabu Okumura; | arxiv-cs.SD | 2024-05-01 |
245 | Efficient Compression of Multitask Multilingual Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. |
Thomas Palmeira Ferraz; | arxiv-cs.CL | 2024-05-01 |
246 | Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Confidence scores of automatic speech recognition (ASR) outputs are often inadequately communicated, preventing its seamless integration into analytical workflows. In this paper, we introduce ConFides, a visual analytic system developed in collaboration with intelligence analysts to address this issue. |
Sunwoo Ha; Chaehun Lim; R. Jordan Crouser; Alvitta Ottley; | arxiv-cs.HC | 2024-04-30 |
247 | Toward Robust ASR System Against Audio Adversarial Examples Using Agitated Logit Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) systems are vulnerable to audio adversarial examples, which aim at deceiving ASR systems by adding perturbations to benign speech signals. These … |
N. Park; Jong Kim; | ACM Transactions on Privacy and Security | 2024-04-26 |
248 | Child Speech Recognition in Human-Robot Interaction: Problem Solved? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. |
RUBEN JANSSENS et. al. | arxiv-cs.CL | 2024-04-26 |
249 | Automatic Speech Recognition System-Independent Word Error Rate Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. |
Chanho Park; Mingjie Chen; Thomas Hain; | arxiv-cs.CL | 2024-04-25 |
250 | Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. |
Chihiro Taguchi; Jefferson Saransig; Dayana Velásquez; David Chiang; | arxiv-cs.CL | 2024-04-23 |
251 | Enhancing ASR Performance Through Relative Word Frequency in OCR and Normal Word Frequency Analysis Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: With the growing interest in Conversational AI, a system that enables machines to engage in human-like dialogues, there has been an increased focus on Automatic Speech Recognition … |
KYUDAN JUNG et. al. | 2024 IEEE 6th International Conference on AI Circuits and … | 2024-04-22 |
252 | Semantically Corrected Amharic Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. |
Samuael Adnew; Paul Pu Liang; | arxiv-cs.CL | 2024-04-20 |
253 | Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. |
Ye Bai; Chenxing Li; Hao Li; Yuanyuan Zhao; Xiaorui Wang; | arxiv-cs.SD | 2024-04-17 |
254 | Synthetic Conversations Improve Multi-Talker ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In recent times, automatic speech recognition (ASR) has seen remarkable progress, particularly in recognizing dominant speakers. Nevertheless, the challenge of multi-talker … |
Thai-Binh Nguyen; Alexander Waibel; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
255 | Task Vector Algebra for ASR Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Vector representations of text and speech signals such as word2vec and wav2vec are used commonly in automatic speech recognition (ASR) and spoken language understanding systems. … |
Gowtham Ramesh; Kartik Audhkhasi; B. Ramabhadran; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
256 | Large Language Models As A Proxy For Human Evaluation In Assessing The Comprehensibility Of Disordered Speech Transcription Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) systems, despite significant advances in recent years, still have much room for improvement particularly in the recognition of disordered … |
KATRIN TOMANEK et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
257 | A Study on The Adverse Impact of Synthetic Speech on Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The high-quality synthetic speech by TTS has been widely used in the field of human-computer interaction, bringing users better experience. However, synthetic speech is prone to … |
Jian Huang; Yancheng Bai; Yang Cai; Wei Bian; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
258 | Automatic Speech Recognition Tuned for Child Speech in The Classroom Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: K-12 school classrooms have proven to be a challenging environment for Automatic Speech Recognition (ASR) systems, both due to background noise and conversation, and differences … |
ROSY SOUTHWELL et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
259 | Extending Large Language Models for Speech and Audio Captioning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multimodal large language models (LLMs) have shown promising visual perception abilities by connecting with image encoders, but their performance on auditory tasks has not yet … |
CHANGLI TANG et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
260 | The Fosafer System for The ICASSP2024 In-Car Multi-Channel Automatic Speech Recognition Challenge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents the Fosafer’s submissions to the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge (ICMC-ASR), which includes both the Automatic Speech … |
Shangkun Huang; Yuxuan Du; Yankai Wang; Jing Deng; Rong Zheng; | 2024 IEEE International Conference on Acoustics, Speech, … | 2024-04-14 |
261 | Exploring Adapters with Conformers for Children’s Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The high variability in acoustic, pronunciation, and linguistic characteristics of children’s speech makes of children’s automatic speech recognition (ASR) a complex task. … |
Thomas Rolland; Alberto Abad; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
262 | Improved Children’s Automatic Speech Recognition Combining Adapters and Synthetic Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Children’s automatic speech recognition (ASR) poses a significant challenge due to the high variability nature of children’s speech. The limited availability of training datasets … |
Thomas Rolland; Alberto Abad; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
263 | Multitask Speech Recognition and Speaker Change Detection for Unknown Number of Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Traditionally, automatic speech recognition (ASR) and speaker change detection (SCD) systems have been independently trained to generate comprehensive transcripts accompanied by … |
SHASHI KUMAR et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
264 | SIR-Progressive Audio-Visual TF-Gridnet with ASR-Aware Selector for Target Speaker Extraction in MISP 2023 Challenge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: TF-GridNet has demonstrated its effectiveness in speech separation and enhancement. In this paper, we extend its capabilities for progressive audio-visual speech enhancement by … |
ZHONGSHU HOU et. al. | 2024 IEEE International Conference on Acoustics, Speech, … | 2024-04-14 |
265 | Enhancing Two-Stage Finetuning for Speech Emotion Recognition Using Adapters Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This study investigates the effective finetuning of a pretrained model using adapters for speech emotion recognition (SER). Since emotion is related with linguistic and prosodic … |
Yuan Gao; Hao Shi; Chenhui Chu; Tatsuya Kawahara; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
266 | Train Long and Test Long:Leveraging Full Document Contexts in Speech Processing Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The quadratic memory complexity of self-attention has generally restricted Transformer-based models to utterance-based speech processing, preventing models from leveraging … |
William Chen; Takatomo Kano; A. Ogawa; Marc Delcroix; Shinji Watanabe; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
267 | Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end (E2E) multi-speaker speech recognition with the serialized output training (SOT) strategy demonstrates good performance in modeling diverse speaker scenarios. However, … |
TAO LI et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
268 | Hot-Fixing Wake Word Recognition for End-to-End ASR Via Neural Model Reprogramming Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper proposes two novel variants of neural reprogramming to enhance wake word recognition in streaming end-to-end ASR models without updating model weights. The first, … |
PIN-JUI KU et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
269 | Automatic Speech Recognition Advancements for Indigenous Languages of The Americas Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods. |
Monica Romero; Sandra Gomez; Ivan G. Torre; | arxiv-cs.CL | 2024-04-12 |
270 | An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. |
Tien-Hong Lo; Fu-An Chao; Tzu-I Wu; Yao-Ting Sung; Berlin Chen; | arxiv-cs.SD | 2024-04-11 |
271 | VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in The Medical Domain Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present VietMed – a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. |
Khai Le-Duc; | arxiv-cs.CL | 2024-04-08 |
272 | Mai Ho’omāuna I Ka ‘Ai: Language Models Improve Automatic Speech Recognition in Hawaiian Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we address the challenge of improving Automatic Speech Recognition (ASR) for a low-resource language, Hawaiian, by incorporating large amounts of independent text data into an ASR foundation model, Whisper. |
Kaavya Chaparala; Guido Zarrella; Bruce Torres Fischer; Larry Kimura; Oiwi Parker Jones; | arxiv-cs.CL | 2024-04-03 |
273 | Noise Masking Attacks and Defenses for Pretrained Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They show that when a record has been seen at training time, the model will transcribe the noisy record with its memorized sensitive transcript. In our work, we extend these attacks beyond ASR models, to attack pretrained speech encoders. |
Matthew Jagielski; Om Thakkar; Lun Wang; | arxiv-cs.LG | 2024-04-02 |
274 | BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. |
Alexandros Haliassos; Andreas Zinonos; Rodrigo Mira; Stavros Petridis; Maja Pantic; | arxiv-cs.CV | 2024-04-02 |
275 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Emotion Neural Transducer for fine-grained speech emotion recognition with automatic speech recognition (ASR) joint training. |
Siyuan Shen; Yu Gao; Feng Liu; Hanyang Wang; Aimin Zhou; | arxiv-cs.SD | 2024-03-28 |
276 | Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. |
YASH JAIN et. al. | arxiv-cs.CL | 2024-03-28 |
277 | DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. |
Yi-Cheng Wang; Hsin-Wei Wang; Bi-Cheng Yan; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-03-26 |
278 | More Than Words: Advancements and Challenges in Speech Recognition for Singing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. |
Anna Kruspe; | arxiv-cs.SD | 2024-03-14 |
279 | A Review on Gujarati Language Based Automatic Speech Recognition (ASR) Systems Related Papers Related Patents Related Grants Related Venues Related Experts View |
Mohit Dua; Bhavesh Bhagat; Shelza Dua; N. Chakravarty; | Int. J. Speech Technol. | 2024-03-12 |
280 | Automatic Speech Recognition (ASR) for The Diagnosis of Pronunciation of Speech Sound Disorders in Korean Children Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. |
TAEKYUNG AHN et. al. | arxiv-cs.CL | 2024-03-12 |
281 | SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. |
Jiayu Du; Jinpeng Li; Guoguo Chen; Wei-Qiang Zhang; | arxiv-cs.CL | 2024-03-12 |
282 | Dataset and Evaluation of Automatic Speech Recognition for Multi-lingual Intent Recognition on Social Robots Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While Automatic Speech Recognition (ASR) systems excel in controlled environments, challenges arise in robot-specifc setups due to unique microphone requirements and added noise … |
Antonio Andriella; Raquel Ros; Yoav Ellinson; Sharon Gannot; S. Lemaignan; | 2024 19th ACM/IEEE International Conference on Human-Robot … | 2024-03-11 |
283 | SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-03-10 |
284 | Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. |
Yufeng Yang; Ashutosh Pandey; DeLiang Wang; | arxiv-cs.SD | 2024-03-10 |
285 | A New Benchmark for Evaluating Automatic Speech Recognition in The Arabic Call Domain Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work aims to establish a robust benchmark that not only encompasses the broad spectrum of Arabic dialects but also emulates the real-world conditions of call-based communications. |
QUSAI ABO OBAIDAH et. al. | arxiv-cs.AI | 2024-03-07 |
286 | Kirigami Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) … |
Sudershan Boovaraghavan; Haozhe Zhou; Mayank Goel; Yuvraj Agarwal; | Proceedings of the ACM on Interactive, Mobile, Wearable and … | 2024-03-06 |
287 | JEP-KD: Joint-Embedding Predictive Architecture Based Knowledge Distillation for Visual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Visual Speech Recognition (VSR) tasks are generally recognized to have a lower theoretical performance ceiling than Automatic Speech Recognition (ASR), owing to the inherent … |
Chang Sun; Hong Yang; Bo Qin; | ArXiv | 2024-03-04 |
288 | PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) … |
Joonas Kalda; Clément Pagés; R. Marxer; Tanel Alumäe; Hervé Bredin; | The Speaker and Language Recognition Workshop | 2024-03-04 |
289 | Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This survey offers a comprehensive review of DTL, FL, and RL-based ASR frameworks, aiming to provide insights into the latest developments and aid researchers and professionals in understanding the current challenges. |
Hamza Kheddar; Mustapha Hemis; Yassine Himeur; | arxiv-cs.SD | 2024-03-02 |
290 | A Cross-Modal Approach to Silent Speech with LLM-Enhanced Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Silent Speech Interfaces (SSIs) offer a noninvasive alternative to brain-computer interfaces for soundless verbal communication. We introduce Multimodal Orofacial Neural Audio … |
Tyler Benster; G. Wilson; Reshef Elisha; Francis R. Willett; S. Druckmann; | ArXiv | 2024-03-02 |
291 | Towards Inclusive Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View |
Siyuan Feng; B. Halpern; O. Kudina; O. Scharenborg; | Comput. Speech Lang. | 2024-03-01 |
292 | Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. |
Heyang Liu; Yu Wang; Yanfeng Wang; | arxiv-cs.CL | 2024-03-01 |
293 | Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. |
Jeehyun Lee; Yerin Choi; Tae-Jin Song; Myoung-Wan Koo; | arxiv-cs.CL | 2024-02-29 |
294 | Probing The Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). |
Quentin Raymondaud; Mickael Rouvier; Richard Dufour; | arxiv-cs.SD | 2024-02-29 |
295 | Exploration of Adapter for Noise Robust Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study thoroughly investigates adapter-based ASR adaptation in noisy environments. |
Hao Shi; Tatsuya Kawahara; | arxiv-cs.SD | 2024-02-28 |
296 | A Multitask Co-training Framework for Improving Speech Translation By Leveraging Speech Recognition and Machine Translation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View |
Yue Zhou; Yuxuan Yuan; Xiaodong Shi; | Neural Comput. Appl. | 2024-02-27 |
297 | Large Language Models Are Efficient Learners of Noise-Robust Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do, where one solution is introducing noise information as a conditioner into LLM.The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. |
YUCHEN HU et. al. | iclr | 2024-02-26 |
298 | An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus exclusively on improving the acoustic encoder of E2E ASR to tackle the challenge caused by the codeswitching phenomenon. |
Tzu-Ting Yang; Hsin-Wei Wang; Yi-Cheng Wang; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-02-26 |
299 | It’s Never Too Late: Fusing Acoustic Information Into Large Language Models for Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). |
CHEN CHEN et. al. | iclr | 2024-02-26 |
300 | LipVoicer: Generating Speech from Silent Videos Guided By Lip Reading Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. |
Yochai Yemini; Aviv Shamsian; Lior Bracha; Sharon Gannot; Ethan Fetaya; | iclr | 2024-02-26 |
301 | Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models. We discovered that the impact of weight parameters on power consumption varies, influenced by factors including how often they are invoked and their placement in memory. |
YANG LI et. al. | arxiv-cs.SD | 2024-02-20 |
302 | Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose the multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. |
Qiushi Zhu; Jie Zhang; Yu Gu; Yuchen Hu; Lirong Dai; | aaai | 2024-02-20 |
303 | OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). |
Yifan Peng; Yui Sudo; Muhammad Shakeel; Shinji Watanabe; | arxiv-cs.CL | 2024-02-19 |
304 | Phantom in The Opera: Adversarial Music Attack for Robot Dialogue System Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This study explores the vulnerability of robot dialogue systems’ automatic speech recognition (ASR) module to adversarial music attacks. Specifically, we explore music as a … |
Sheng Li; Jiyi Li; Yang Cao; | Frontiers Comput. Sci. | 2024-02-15 |
305 | An Embarrassingly Simple Approach for LLM with Strong ASR Capacity IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). |
ZIYANG MA et. al. | arxiv-cs.CL | 2024-02-13 |
306 | The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems’ performance in multilingual settings. |
Ajinkya Kulkarni; Anna Tokareva; Rameez Qureshi; Miguel Couceiro; | arxiv-cs.CL | 2024-02-12 |
307 | Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. |
HEESEUNG KIM et. al. | arxiv-cs.CL | 2024-02-08 |
308 | A Comprehensive Study of The Current State-of-the-Art in Nepali Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we examine the research conducted in the field of Nepali Automatic Speech Recognition (ASR). |
Rupak Raj Ghimire; Bal Krishna Bal; Prakash Poudyal; | arxiv-cs.SD | 2024-02-05 |
309 | Digits Micro-model for Accurate and Secure Transactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. |
Chirag Chhablani; Nikhita Sharma; Jordan Hosier; Vijay K. Gurbani; | arxiv-cs.LG | 2024-02-02 |
310 | Streaming Sequence Transduction Through Dynamic Compression Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. |
WEITING TAN et. al. | arxiv-cs.CL | 2024-02-02 |
311 | AccentFold: A Journey Through African Accents for Zero-Shot ASR Adaptation to Target Accents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). |
Abraham Toluwase Owodunni; Aditya Yadavalli; Chris Chinenye Emezue; Tobi Olatunji; Clinton C Mbataku; | arxiv-cs.CL | 2024-02-02 |
312 | Chinese Dialect Speech Recognition: A Comprehensive Survey Related Papers Related Patents Related Grants Related Venues Related Experts View |
Qiang Li; Qianyu Mai; Mandou Wang; Mingjuan Ma; | Artif. Intell. Rev. | 2024-01-31 |
313 | Exploring The Limits of Decoder-only Models Trained on Public Speech Recognition Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. |
Ankit Gupta; George Saon; Brian Kingsbury; | arxiv-cs.CL | 2024-01-31 |
314 | Improving ASR Performance with OCR Through Using Word Frequency Difference IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recently, there has been a growing interest in conversational artificial intelligence (AI). As a result, research is actively being conducted on automatic speech recognition (ASR) … |
Kyudan Jung; Seungmin Bae; N. Kim; Hyun Gon Ryu; Hyuk-Jae Lee; | 2024 International Conference on Electronics, Information, … | 2024-01-28 |
315 | Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent research highlights the dependency of BPE subword tokenization’s efficacy on the morphological nature of the language, particularly in languages rich in inflectional morphology, where fewer BPE merges suffice for generating highly productive tokens. Motivated by this, our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity, thus enhancing out-of-distribution automatic speech recognition (ASR) performance. |
Ahnaf Mozib Samin; | arxiv-cs.CL | 2024-01-27 |
316 | Toward Practical Automatic Speech Recognition and Post-Processing: A Call for Explainable Error Benchmark Guideline Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. |
SEONMIN KOO et. al. | arxiv-cs.CL | 2024-01-25 |
317 | SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. |
CHYI-JIUNN LIN et. al. | arxiv-cs.CL | 2024-01-24 |
318 | MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker’s emotion, with the text … |
Jiajun He; Xiaohan Shi; Xingfeng Li; Tomoki Toda; | arxiv-cs.CL | 2024-01-24 |
319 | Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. |
W. RONNY HUANG et. al. | arxiv-cs.CL | 2024-01-23 |
320 | Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. |
Michael Hentschel; Yuta Nishikawa; Tatsuya Komatsu; Yusuke Fujita; | arxiv-cs.CL | 2024-01-22 |
321 | Using Large Language Model for End-to-End Chinese ASR and NER Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. |
YUANG LI et. al. | arxiv-cs.CL | 2024-01-20 |
322 | SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. |
Hao Wang; Shuhei Kurita; Shuichiro Shimizu; Daisuke Kawahara; | arxiv-cs.CV | 2024-01-18 |
323 | Joint Unsupervised and Supervised Training for Automatic Speech Recognition Via Bilevel Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. |
A F M SAIF et. al. | arxiv-cs.CL | 2024-01-13 |
324 | LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. |
Fan Yu; Haoxu Wang; Xian Shi; Shiliang Zhang; | arxiv-cs.SD | 2024-01-12 |
325 | XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This research paper focuses on the development and evaluation of Automatic Speech Recognition (ASR) technology using the XLS-R 300m model. The study aims to improve ASR … |
Panji Arisaputra; Alif Tri Handoyo; Amalia Zahra; | ArXiv | 2024-01-12 |
326 | UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. |
JIAXIN GUO et. al. | arxiv-cs.CL | 2024-01-11 |
327 | Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: \textbf{Objectives}: We aimed to investigate how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy, specifically in the “Cookie Theft” picture description task. |
Changye Li; Weizhe Xu; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-01-10 |
328 | High-precision Voice Search Query Correction Via Retrievable Speech-text Embedings Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. |
CHRISTOPHER LI et. al. | arxiv-cs.CL | 2024-01-08 |
329 | A New MmWave-Speech Multimodal Speech System for Voice User Interface Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Voice user interface (VUI) plays an essential role in intelligent scenes, e.g., smart homes. It provides a hands- and eyes-free human-machine interaction between humans and … |
Tiantian Liu; Feng Lin; | GetMobile: Mobile Computing and Communications | 2024-01-08 |
330 | An Audio-quality-based Multi-strategy Approach for Target Speaker Extraction in The MISP 2023 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. |
RUNDUO HAN et. al. | arxiv-cs.SD | 2024-01-08 |
331 | Cross-Speaker Encoding Network for Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. |
JIAWEN KANG et. al. | arxiv-cs.SD | 2024-01-08 |
332 | ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. |
HE WANG et. al. | arxiv-cs.SD | 2024-01-07 |
333 | MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. |
He Wang; Pengcheng Guo; Pan Zhou; Lei Xie; | arxiv-cs.SD | 2024-01-07 |
334 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we introduce a method that utilizes the ASR system’s lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. |
KEVIN EVERSON et. al. | arxiv-cs.CL | 2024-01-05 |
335 | An Approach for Speech Enhancement in Low SNR Environments Using Granular Speaker Embedding Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The proliferation of speech technology applications has led to an unprecedented demand for effective speech enhancement techniques, particularly in low Signal-to-Noise Ratio (SNR) … |
Jayasree Saha; Rudrabha Mukhopadhyay; A. Agrawal; Surabhi Jain; C. V. Jawahar; | Proceedings of the 7th Joint International Conference on … | 2024-01-04 |
336 | Research on The Application of Speech Database Based on Emotional Feature Extraction in International Chinese Education and Teaching Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The advanced analysis of the relationship between acoustic and emotional characteristics of speech signals can effectively improve the interactivity and intelligence of computers. … |
Xiangli Zhang; | Scalable Comput. Pract. Exp. | 2024-01-04 |
337 | Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. |
Rita Frieske; Bertram E. Shi; | arxiv-cs.CL | 2024-01-03 |
338 | Chinese Spoken Named Entity Recognition in Real-world Scenarios: Dataset and Approaches Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Spoken Named Entity Recognition (NER) aims 001 to extract entities from speech. The extracted 002 entities can help voice assistants better under-003 stand user’s questions and … |
SHILIN ZHOU et. al. | Annual Meeting of the Association for Computational … | 2024-01-01 |
339 | Arabic Speech Recognition: Advancement and Challenges Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech recognition is a captivating process that revolutionizes human-computer interactions, allowing us to interact and control machines through spoken commands. The foundation … |
ASHIFUR RAHMAN et. al. | IEEE Access | 2024-01-01 |
340 | Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Audio-Visual Speech Recognition (AVSR) is a promising approach to improving the accuracy and robustness of speech recognition systems with the assistance of visual cues in … |
Jiahong Li; Chenda Li; Yifei Wu; Yanmin Qian; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
341 | Pretraining and Adaptation Techniques for Electrolaryngeal Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We investigate state-of-the-art automatic speech recognition (ASR) systems and provide thorough investigations on training methods to adapt them to low-resourced electrolaryngeal … |
Lester Phillip Violeta; D. Ma; Wen-Chin Huang; T. Toda; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
342 | Exploring Native and Non-Native English Child Speech Recognition With Whisper Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Modern end-to-end Automatic Speech Recognition (ASR) systems struggle to recognise children’s speech. This challenge is due to the high acoustic variability in children’s voices … |
Rishabh Jain; Andrei Barcovschi; Mariam Yiwere; Peter Corcoran; H. Cucu; | IEEE Access | 2024-01-01 |
343 | Fine-Tuning ASR Models for Very Low-Resource Languages: A Study on Mvskoke Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent advancements in multilingual models for automatic speech recognition (ASR) have been able to achieve a high accuracy for languages with extremely limited resources. This … |
Julia Mainzinger; Gina-Anne Levow; | Annual Meeting of the Association for Computational … | 2024-01-01 |
344 | ESAformer: Enhanced Self-Attention for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this letter, an Enhanced Self-Attention (ESA) module has been put forward for feature extraction. The proposed ESA is integrated with the recursive gated convolution and … |
Junhua Li; Zhikui Duan; Shiren Li; Xinmei Yu; Guangguang Yang; | IEEE Signal Processing Letters | 2024-01-01 |
345 | Explainability of Speech Recognition Transformers Via Gradient-Based Attention Visualization Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: In vision Transformers, attention visualization methods are used to generate heatmaps highlighting the class-corresponding areas in input images, which offers explanations on how … |
Tianli Sun; Haonan Chen; Guosheng Hu; Lianghua He; Cairong Zhao; | IEEE Transactions on Multimedia | 2024-01-01 |
346 | Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While waveform-domain speech enhancement (SE) has been extensively investigated in recent years and achieves state-of-the-art performance in many datasets, spectrogram-based SE … |
Hao Shi; M. Mimura; Tatsuya Kawahara; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
347 | Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Large Language Models (LLMs) have proven successful across a spectrum of speech-related tasks, such as speech recognition, text-to-speech, and spoken language understanding. … |
Yukun Ma; Chong Zhang; Qian Chen; Wen Wang; Bin Ma; | IEEE Signal Processing Letters | 2024-01-01 |
348 | Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper addresses the training issues associated with neural network-based automatic speech recognition (ASR) under noise conditions. In particular, conventional joint training … |
Geon Woo Lee; Hong Kook Kim; Duk-Jo Kong; | IEEE Access | 2024-01-01 |
349 | NAO Vs. Pepper: Speech Recognition Performance Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View |
Akshara Pande; Deepti Mishra; Bhavana Nachenahalli Bhuthegowda; | Interacción | 2024-01-01 |
350 | Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. |
HUIMENG WANG et. al. | arxiv-cs.SD | 2023-12-31 |
351 | KEBAP: Korean Error Explainable Benchmark Dataset for ASR and Post-processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional evaluation metrics for ASR systems produce a singular aggregate score, which is insufficient for understanding specific system vulnerabilities. Therefore, we aim to address the limitations of the previous ASR evaluation methods by introducing the Korean Error Explainable Benchmark Dataset for ASR and Post-processing (KEBAP). |
SEONMIN KOO et. al. | emnlp | 2023-12-22 |
352 | Accented Speech Recognition With Accent-specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. |
Darshan Prabhu; Preethi Jyothi; Sriram Ganapathy; Vinit Unni; | emnlp | 2023-12-22 |
353 | Back Transcription As A Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. |
Marek Kubis; Pawel Sk�rzewski; Marcin Sowannski; Tomasz Zietkiewicz; | emnlp | 2023-12-22 |
354 | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). |
SRIJITH RADHAKRISHNAN et. al. | emnlp | 2023-12-22 |
355 | CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, the availability of datasets for this is limited. To address this issue, we present CS2W, a Chinese Spoken-to-Written style conversion dataset comprising 7,237 spoken sentences extracted from transcribed conversational texts. |
Zishan Guo; Linhao Yu; Minghui Xu; Renren Jin; Deyi Xiong; | emnlp | 2023-12-22 |
356 | Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we attempt to resolve structurally ambiguous utterances into unambiguous texts in Indonesian using prosodic information. |
RUHIYAH WIDIAPUTRI et. al. | emnlp | 2023-12-22 |
357 | CLAD-ST: Contrastive Learning with Adversarial Data for Robust Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address this robustness problem in downstream MT models by forcing the MT encoder to bring the representations of a noisy input closer to its clean version in the semantic space. This is achieved by introducing a contrastive learning method that leverages adversarial examples in the form of ASR outputs paired with their corresponding human transcripts to optimize the network parameters. |
Sathish Indurthi; Shamil Chollampatt; Ravi Agrawal; Marco Turchi; | emnlp | 2023-12-22 |
358 | Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. |
Christopher Simic; Tobias Bocklet; | arxiv-cs.SD | 2023-12-21 |
359 | Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning. |
ANIRUDH S. SUNDAR et. al. | arxiv-cs.LG | 2023-12-21 |
360 | KNN-CTC: Enhancing ASR Via Retrieval of CTC Pseudo Labels Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2023-12-20 |
361 | SpokesBiz — An Open Corpus of Conversational Polish Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We outline the general structure and content of the corpus, showcasing selected applications in linguistic research, evaluation and improvement of automatic speech recognition (ASR) systems |
PIOTR PĘZIK et. al. | arxiv-cs.CL | 2023-12-19 |
362 | SpokesBiz – An Open Corpus of Conversational Polish Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper announces the early release of SpokesBiz, a freely available corpus of conversational Polish developed within the CLARIN-BIZ project and comprising over 650 hours of … |
PIOTR PEZIK et. al. | ArXiv | 2023-12-19 |
363 | Arabic Speech Recognition Based on Self Supervised Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Arabic Speech Recognition (AASR) has gained significant attention in recent years due to its potential applications in various fields such as transcription, voice … |
Hiba Adreese Younis; Yusra Faisal Mohammad; | 2023 16th International Conference on Developments in … | 2023-12-18 |
364 | Speaker Mask Transformer for Multi-talker Overlapped Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multi-talker overlapped speech recognition remains a significant challenge, requiring not only speech recognition but also speaker diarization tasks to be addressed. In this … |
Peng Shen; Xugang Lu; Hisashi Kawai; | ArXiv | 2023-12-18 |
365 | Seq2seq for Automatic Paraphasia Detection in Aphasic Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel, sequence-to-sequence (seq2seq) model that is trained end-to-end (E2E) to perform both ASR and paraphasia detection tasks. |
MATTHEW PEREZ et. al. | arxiv-cs.SD | 2023-12-16 |
366 | Towards Robust Packet Loss Concealment System With ASR-Guided Representations Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Despite the significant advancements and promising performance of deep learning-based packet loss concealment (PLC) systems in transmission systems, their focus on modeling … |
Dali Yang; Joon-Hyuk Chang; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
367 | Ending The Blind Flight: Analyzing The Impact of Acoustic and Lexical Factors on WAV2VEC 2.0 in Air-Traffic Control Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Transformer neural networks have shown remarkable success on standard automatic speech recognition (ASR) benchmarks. However, they are known to be less robust against domain … |
Alexander Blatt; Badr M. Abdullah; D. Klakow; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
368 | Hierarchical Attention-Based Contextual Biasing For Personalized Speech Recognition Using Neural Transducers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Although end-to-end (E2E) automatic speech recognition (ASR) systems excel in general tasks, they frequently struggle with accurately recognizing personal rare words. Leveraging … |
Sibo Tong; Philip Harding; Simon Wiesler; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
369 | Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Transfer learning from large multilingual pretrained models, like XLSR, has become the new paradigm for Automatic Speech Recognition (ASR). Considering their ever-increasing size, … |
GEOFFROY VANDERREYDT et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
370 | Parameter-Efficient Cross-Language Transfer Learning for A Language-Modular Audiovisual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In audiovisual speech recognition (AV-ASR), for many languages only few audiovisual data is available. Building upon an English model, in this work, we first apply and analyze … |
ZHENGYANG LI et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
371 | Conformer-Based Speech Recognition On Extreme Edge-Computing Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. |
MINGBIN XU et. al. | arxiv-cs.LG | 2023-12-16 |
372 | Leveraging The Multilingual Indonesian Ethnic Languages Dataset In Self-Supervised Models for Low-Resource ASR Task Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Indonesia is home to roughly 700 languages, which amounts to about ten percent of the global total, positioning it as the second-most linguistically diverse country after Papua … |
S. Sakti; Benita Angela Titalim; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
373 | LiteVSR: Efficient Visual Speech Recognition By Learning from Speech Representations of Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. |
HENDRIK LAUX et. al. | arxiv-cs.CV | 2023-12-15 |
374 | Automatic Channel Selection and Spatial Feature Integration for Multi-channel Speech Recognition Across Various Array Topologies Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. |
BINGSHEN MU et. al. | arxiv-cs.SD | 2023-12-15 |
375 | Knowledge Prompt for Whisper: An ASR Entity Correction Approach with Knowledge Base Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Entity correction is crucial in Automatic Speech TABLE I Recognition (ASR), since erroneous entities seriously affect our understanding of ASR results. In this paper, in order to … |
MIN ZHANG et. al. | 2023 IEEE International Conference on Big Data (BigData) | 2023-12-15 |
376 | Improvement of Automatic Speech Recognition Systems Utilizing 2D Adaptive Wavelet Transformation Applied to Recurrence Plot of Speech Trajectories Related Papers Related Patents Related Grants Related Venues Related Experts View |
S. Firooz; F. Almasganj; Yasser Shekofteh; | Signal, Image and Video Processing | 2023-12-15 |
377 | On The Compression of Shallow Non-causal ASR Models Using Knowledge Distillation and Tied-and-reduced Decoder for Low-latency On-device Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. |
NAGARAJ ADIGA et. al. | arxiv-cs.SD | 2023-12-15 |
378 | Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most past studies have simplified the learning complexity of the model by splitting the code-switching task into multiple tasks dealing with a single language and then learning the domain-specific knowledge of each language separately. Therefore, in this paper, we attempt to introduce language identification information into the middle layer of the ASR model’s encoder. |
Tzu-Ting Yang; Hsin-Wei Wang; Berlin Chen; | arxiv-cs.CL | 2023-12-15 |
379 | Hourglass-AVSR: Down-Up Sampling-Based Computational Efficiency Model for Audio-Visual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recently audio-visual speech recognition (AVSR), which better leverages video modality as additional information to extend automatic speech recognition (ASR), has shown promising … |
Fan Yu; Haoxu Wang; Ziyang Ma; Shiliang Zhang; | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
380 | FastInject: Injecting Unpaired Text Data Into CTC-Based ASR Training Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recently, connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models have achieved impressive results, especially with the … |
Keqi Deng; Phil Woodland; | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
381 | Towards Automatic Data Augmentation for Disordered Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data … |
ZENGRUI JIN et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-12-14 |
382 | Extending Whisper with Prompt Tuning to Target-speaker ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. |
Hao Ma; Zhiyuan Peng; Mingjie Shao; Jing Li; Ju Liu; | arxiv-cs.CL | 2023-12-13 |
383 | ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, a time-domain recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy based on convolutional encoder-decoder-based U-Net framework, which serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model. |
Xincheng Yu; Dongyue Guo; Jianwei Zhang; Yi Lin; | arxiv-cs.SD | 2023-12-10 |
384 | Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. |
Wonjun Lee; Gary Geunbae Lee; Yunsu Kim; | arxiv-cs.CL | 2023-12-06 |
385 | Taiwanese Hakka Across Taiwan Corpus and Formosa Speech Recognition Challenge 2023 – Hakka ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: To revive the endangered Taiwanese Hakka language, the first large-scale Taiwanese Hakka speech corpus across Taiwan (HAT) was developed, representing modern Taiwanese Hakka … |
YUAN-FU LIAO et. al. | 2023 26th Conference of the Oriental COCOSDA International … | 2023-12-04 |
386 | End-to-End Speech-to-Text Translation: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, researchers have been exploring end-to-end (E2E) models for ST translation. |
Nivedita Sethiya; Chandresh Kumar Maurya; | arxiv-cs.CL | 2023-12-02 |
387 | FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. |
Dongning Yang; Wei Wang; Yanmin Qian; | arxiv-cs.SD | 2023-11-29 |
388 | End-to-end Joint Punctuated and Normalized ASR with A Limited Amount of Punctuated Training Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose two approaches to train an end-to-end joint punctuated and normalized ASR system using limited punctuated data. |
Can Cui; Imran Ahamad Sheikh; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.CL | 2023-11-29 |
389 | Research Applications of Hidden Markov Models in Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Reinforcement Learning, a vital branch of Machine Learning, has gained significant attention due to its interactive and goal-oriented learning approach. Its primary objective is … |
Zeng Li; Zhenzhen Wang; Xiaofei Sun; | Proceedings of the 2023 International Conference on … | 2023-11-18 |
390 | On The Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. |
Xiaohan Shi; Jiajun He; Xingfeng Li; Tomoki Toda; | arxiv-cs.SD | 2023-11-13 |
391 | Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. |
Qijie Shao; Pengcheng Guo; Jinghao Yan; Pengfei Hu; Lei Xie; | arxiv-cs.SD | 2023-11-12 |
392 | Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose to model speech tokens in an autoregressive way, similar to text. |
QIAN CHEN et. al. | arxiv-cs.CL | 2023-11-08 |
393 | Improved Child Text-to-Speech Synthesis Through Fastpitch-based Transfer Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech. |
Rishabh Jain; Peter Corcoran; | arxiv-cs.SD | 2023-11-07 |
394 | Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. |
RABINDRA NATH NANDI et. al. | arxiv-cs.CL | 2023-11-06 |
395 | COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. |
JING PAN et. al. | arxiv-cs.CL | 2023-11-03 |
396 | Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models Via Language-Specific Experts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. |
Thomas Palmeira Ferraz; Marcely Zanon Boito; Caroline Brun; Vassilina Nikoulina; | arxiv-cs.CL | 2023-11-02 |
397 | Learning Adapters for Code-Switching Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multilingual code-switching speech recognition has been an emerging research direction in real-world applications since most of speakers are bilingual or multilingual. A … |
Chun-Yi He; Jen-Tzung Chien; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
398 | Speech Emotion Recognition By Late Fusion of Linguistic and Acoustic Features Using Deep Learning Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this study, we investigated speech emotion recognition using both linguistic and acoustic features contained in emotional speech. Speech recognition is necessary to extract … |
Kiyohide Sato; Keita Kishi; Tetsuo Kosaka; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
399 | Incorporating Pinyin Into Pipeline Named Entity Recognition from Chinese Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Named Entity Recognition (NER) from speech is usually implemented through a two-step pipeline that consists of (1) processing the audio using an Automatic Speech Recognition (ASR) … |
MIN ZHANG et. al. | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
400 | ASR Model Adaptation for Rare Words Using Synthetic Data Generated By Multiple Text-To-Speech Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) for rare words is difficult as there are little relevant text-audio data pairs to train an ASR model. To obtain more text-audio pairs, text-only … |
Kwok Chin Yuen; Haoyang Li; Chng Eng Siong; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
401 | Multi-Self-Supervised Learning Model-Based Throat Microphone Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Throat microphones (TMs) can record sounds and mitigate the effects of external noise. Ongoing works seek to apply TMs to speech recognition in high-noise environments. However, … |
Kohta Masuda; Jun Ogata; Masafumi Nishida; Masafumi Nishimura; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
402 | Transformer-based Automatic Speech Recognition of Simultaneous Interpretation with Auxiliary Input of Source Language Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) of simultaneous interpretation is challenging due to disfluencies such as hesitations, filled pauses, interruptions, and self-repairs. … |
Shuta Taniguchi; Tsuneo Kato; Akihiro Tamura; Keiji Yasuda; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
403 | Synthetic Data Augmentation for ASR with Domain Filtering Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recent studies have shown that synthetic speech can effectively serve as training data for automatic speech recognition models. Text data for synthetic speech is mostly obtained … |
Tuan Vu Ho; Shota Horiguchi; Shinji Watanabe; Paola Garcia; Takashi Sumiyoshi; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
404 | MUST: A Multilingual Student-Teacher Learning Approach for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, the aforementioned limitation is addressed by proposing a MUltilingual Student-Teacher (MUST) learning which exploits a posteriors mapping approach. |
Muhammad Umar Farooq; Rehan Ahmad; Thomas Hain; | arxiv-cs.CL | 2023-10-28 |
405 | MADGF: Multi-Agent Data Generation Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic Speech Recognition (ASR) systems predominantly cater to monolingual inputs and struggle with the complexity introduced by mixed language audio. In this paper, we present a novel Multi-Agent Data Generation Framework (MADGF) to address this challenge. |
Peng Xie; Kani Chen; | arxiv-cs.SD | 2023-10-27 |
406 | A Review on Speech Recognition for Under-Resourced Languages: A Case Study of Vietnamese Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Fundamental speech recognition technologies for high-resourced languages are currently successful to build high-quality applications with the use of deep learning models. However, … |
Trung-Nghia Phung; Duc-Binh Nguyen; Ngoc-Phuong Pham; | Int. J. Knowl. Syst. Sci. | 2023-10-27 |
407 | Back Transcription As A Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. |
Marek Kubis; Paweł Skórzewski; Marcin Sowański; Tomasz Ziętkiewicz; | arxiv-cs.CL | 2023-10-25 |
408 | Evaluating A Fine-Tuned Whisper Model on Underrepresented Romanian Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech datasets available for training Romanian automatic speech recognition (ASR) systems are constructed around similar demographics (male voices, age between 19-29 years). In … |
V. Pais; V. Mititelu; Radu Ion; Elena Irimia; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
409 | A Comparative Analysis Between Conformer-Transducer, Whisper, and Wav2vec2 for Improving The Child Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Automatic Speech Recognition (ASR) systems have progressed significantly in their performance on adult speech data; however, transcribing child speech remains challenging due to … |
Andrei Barcovschi; Rishabh Jain; Peter Corcoran; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
410 | Dysarthric Speech Recognition Using Depthwise Separable Convolutions: Preliminary Study Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: As a neurological disability that affects muscles involved in articulation, dysarthria is a speech impairment that leads to reduced speech intelligibility. In severe cases, these … |
Seyed Reza Shahamiri; Krishnendu Mandal; Sudeshna Sarkar; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
411 | Uncovering Bias in ASR Systems: Evaluating Wav2vec2 and Whisper for Dutch Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: It is crucial that ASR systems can handle the wide range of variations in speech of speakers from different demographic groups, with different speaking styles, and of speakers … |
Márcio Fuckner; Sophie Horsman; Pascal Wiggers; Iskaj Janssen; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
412 | ArTST: Arabic Text and Speech Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. |
Hawau Olamide Toyin; Amirbek Djanibekov; Ajinkya Kulkarni; Hanan Aldarmaki; | arxiv-cs.CL | 2023-10-25 |
413 | Hypotheses Paradise: An Open and Strong Baseline for Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. |
CHEN CHEN et. al. | nips | 2023-10-24 |
414 | CDSD: Chinese Dysarthria Speech Database Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the Chinese Dysarthria Speech Database (CDSD) as a valuable resource for dysarthria research. |
MENGYI SUN et. al. | arxiv-cs.SD | 2023-10-24 |
415 | How Much Context Does My Attention-Based ASR System Need? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. |
Robert Flynn; Anton Ragni; | arxiv-cs.CL | 2023-10-24 |
416 | Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. |
SARA PAPI et. al. | arxiv-cs.CL | 2023-10-23 |
417 | Intuitive Multilingual Audio-Visual Speech Recognition with A Single-Trained Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. |
Joanna Hong; Se Jin Park; Yong Man Ro; | arxiv-cs.MM | 2023-10-23 |
418 | Conversational Speech Recognition By Learning Audio-textual Cross-modal Contextual Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. |
KUN WEI et. al. | arxiv-cs.SD | 2023-10-22 |
419 | Intelligibility Prediction with A Pretrained Noise-robust Automatic Speech Recognition Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper describes two intelligibility prediction systems derived from a pretrained noise-robust automatic speech recognition (ASR) model for the second Clarity Prediction … |
Zehai Tu; Ning Ma; Jon Barker; | ArXiv | 2023-10-20 |
420 | BUT CHiME-7 System Description Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes the joint effort of Brno University of Technology (BUT), AGH University of Krakow and University of Buenos Aires on the development of Automatic Speech Recognition systems for the CHiME-7 Challenge. |
MARTIN KARAFIÁT et. al. | arxiv-cs.SD | 2023-10-18 |
421 | VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the linguistic diversity and variations, it is challenging to build a robust and generalized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identification (DID) as well as automatic speech recognition (ASR) of Arabic. |
Abdul Waheed; Bashar Talafha; Peter Sullivan; AbdelRahim Elmadany; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2023-10-17 |
422 | Generative Error Correction for Code-switching Speech Recognition Using Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), … |
CHEN CHEN et. al. | ArXiv | 2023-10-17 |
423 | Correction Focused Language Model Training for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. |
Yingyi Ma; Zhe Liu; Ozlem Kalinli; | arxiv-cs.CL | 2023-10-17 |
424 | Multi-stage Large Language Model Correction for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. |
Jie Pu; Thai-Son Nguyen; Sebastian Stüker; | arxiv-cs.CL | 2023-10-17 |
425 | Noise-Robust Automatic Speech Recognition for Industrial and Urban Environments Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) models can achieve human parity, but their performance degrades significantly when used in noisy industrial and urban environments. In this … |
Daniil Orel; H. A. Varol; | IECON 2023- 49th Annual Conference of the IEEE Industrial … | 2023-10-16 |
426 | End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. |
Can Cui; Imran Ahamad Sheikh; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.CL | 2023-10-16 |
427 | Detecting Speech Abnormalities With A Perceiver-Based Sequence Classifier That Leverages A Universal Speech Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We propose a Perceiver-based sequence classifier to detect abnormalities in speech reflective of several neurological disorders. We combine this classifier with a Universal Speech … |
H. SOLTAU et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-10-16 |
428 | Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification. |
ZHIHONG LEI et. al. | arxiv-cs.LG | 2023-10-15 |
429 | Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing leveraging the power of deep learning models in accurately delivering spot-on transcriptions across a wide variety of vocabularies and speaking styles. |
Ankitha Sudarshan; Vinay Samuel; Parth Patwa; Ibtihel Amara; Aman Chadha; | arxiv-cs.CL | 2023-10-14 |
430 | SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. |
ZHEHUAI CHEN et. al. | arxiv-cs.CL | 2023-10-13 |
431 | On The Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. |
Nick Rossenbach; Benedikt Hilmes; Ralf Schlüter; | arxiv-cs.CL | 2023-10-12 |
432 | Adapting The Adapters for Code-switching in Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. |
Atharva Kulkarni; Ajinkya Kulkarni; Miguel Couceiro; Hanan Aldarmaki; | arxiv-cs.CL | 2023-10-11 |
433 | A Study of Speech Recognition, Speech Translation, and Speech Summarization of TED English Lectures Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Our research focuses on developing an automatic speech recognition system for English lectures, which involves summarizing the content and providing Japanese subtitles. Subtitling … |
Kazumasa Yamamoto; Haruhiko Banno; Haruki Sakurai; Toichiro Adachi; Seiichi Nakagawa; | 2023 IEEE 12th Global Conference on Consumer Electronics … | 2023-10-10 |
434 | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). |
SRIJITH RADHAKRISHNAN et. al. | arxiv-cs.CL | 2023-10-10 |
435 | No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While in the context of hybrid ASR models several solutions have been proposed, the gender bias issue has not been explicitly addressed in end-to-end neural architectures. To fill this gap, we propose a data augmentation technique that manipulates the fundamental frequency (f0) and formants. |
Dennis Fucci; Marco Gaido; Matteo Negri; Mauro Cettolo; Luisa Bentivogli; | arxiv-cs.CL | 2023-10-10 |
436 | Acoustic Model Fusion for End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. |
ZHIHONG LEI et. al. | arxiv-cs.SD | 2023-10-10 |
437 | ToozKit: System for Experimenting with Captions on A Head-worn Display Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The advent of Automatic Speech Recognition (ASR) has made real-time captioning for the Deaf and Hard-of-Hearing (DHH) community possible, and integration of ASR into Head-worn … |
Peter Feng; David Martin; Thad Starner; | Adjunct Proceedings of the 2023 ACM International Joint … | 2023-10-08 |
438 | Ed-cec: Improving Rare Word Recognition Using Asr Postprocessing Based on Error Detection and Context-aware Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection and context-aware error correction. |
Jiajun He; Zekun Yang; Tomoki Toda; | arxiv-cs.AI | 2023-10-08 |
439 | Improving End-to-End Speech Processing By Efficient Text Data Utilization with Latent Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. |
JIANQIAO LU et. al. | arxiv-cs.CL | 2023-10-08 |
440 | LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. |
ZHIHAO DU et. al. | arxiv-cs.SD | 2023-10-06 |
441 | Dementia Assessment Using Mandarin Speech with An Attention-based Speech Recognition Encoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper utilizes a speech recognition model to construct a dementia assessment system tailored for Mandarin speakers during the picture description task. |
ZIH-JYUN LIN et. al. | arxiv-cs.CL | 2023-10-05 |
442 | EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. |
Tejes Srivastava; Jiatong Shi; William Chen; Shinji Watanabe; | arxiv-cs.SD | 2023-10-05 |
443 | LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of End-to-end ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models. |
ALEKSANDR MEISTER et. al. | arxiv-cs.CL | 2023-10-04 |
444 | Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. |
Liming Wang; Mark Hasegawa-Johnson; Chang D. Yoo; | arxiv-cs.CL | 2023-10-03 |
445 | Evaluating Speech Synthesis By Training Recognizers on Synthetic Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Prior works focus on evaluating synthetic speech based on pre-trained speech recognition models, however, this can be limiting since this approach primarily measures speech intelligibility. In this paper, we propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech. |
DAREEN ALHARTHI et. al. | arxiv-cs.CL | 2023-10-01 |
446 | AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. |
TOBI OLATUNJI et. al. | arxiv-cs.CL | 2023-09-30 |
447 | AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. |
Andrew Rouditchenko; Ronan Collobert; Tatiana Likhomanenko; | arxiv-cs.LG | 2023-09-29 |
448 | SLM: Bridge The Thin Gap Between Speech and Text Foundation Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. |
MINGQIU WANG et. al. | arxiv-cs.CL | 2023-09-29 |
449 | Federated Learning with Differential Privacy for End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to bridge this research gap by formulating an ASR benchmark for FL with DP and establishing the first baselines. |
MARTIN PELIKAN et. al. | arxiv-cs.LG | 2023-09-29 |
450 | The Gift of Feedback: Improving ASR Model Quality By Learning from User Corrections Through Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. |
LILLIAN ZHOU et. al. | arxiv-cs.CL | 2023-09-29 |
451 | LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. |
GUODONG MA et. al. | arxiv-cs.SD | 2023-09-28 |
452 | HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. |
CHEN CHEN et. al. | arxiv-cs.CL | 2023-09-27 |
453 | Lip2Vec: Efficient and Robust Visual Speech Recognition Via Latent-to-Latent Visual to Audio Representation Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Un-like previous works that involve auxiliary losses or com-plex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. |
Yasser Abdelaziz Dahou Djilali; Sanath Narayan; Haithem Boussaid; Ebtessam Almazrouei; Merouane Debbah; | iccv | 2023-09-27 |
454 | Speech Collage: Code-switched Audio Generation By Collaging Monolingual Corpora Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. |
AMIR HUSSEIN et. al. | arxiv-cs.SD | 2023-09-27 |
455 | Unsupervised Pre-Training for Vietnamese Automatic Speech Recognition in The HYKIST Project Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In today’s interconnected globe, moving abroad is more and more prevalent, whether it’s for employment, refugee resettlement, or other causes. Language difficulties between … |
Khai Le-Duc; | ArXiv | 2023-09-26 |
456 | Updated Corpora and Benchmarks for Long-Form Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we re-release three standard ASR corpora – TED-LIUM 3, Gigapeech, and VoxPopuli-en – with updated transcription and alignments to enable their use for long-form ASR research. |
JENNIFER DREXLER FOX et. al. | arxiv-cs.CL | 2023-09-26 |
457 | Learning From Flawed Data: Weakly Supervised Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform “non-verbatim” transcription, … |
DONGJI GAO et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-09-26 |
458 | Speech Dereverberation With Frequency Domain Autoregressive Modeling Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to … |
Anurenjan Purushothaman; Debottam Dutta; Rohit Kumar; Sriram Ganapathy; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2023-09-24 |
459 | A Survey of Automatic Speech Recognition Deep Models Performance for Polish Medical Terms Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Among the numerous applications of speech-to-text technology is the support of documentation created by medical personnel. There are many available speech recognition systems for … |
MARTA ZIELONKA et. al. | 2023 Signal Processing: Algorithms, Architectures, … | 2023-09-20 |
460 | AudioFool: Fast, Universal and Synchronization-free Cross-Domain Attack on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent research has focused on exploring methods to create such attacks, however, some issues relating to Over-The-Air (OTA) attacks have not been properly addressed. In our work, we examine the needed properties of robust attacks compatible with the OTA model, and we design a method of generating attacks with arbitrary such desired properties, namely the invariance to synchronization, and the robustness to filtering: this allows a Denial-of-Service (DoS) attack against ASR systems. |
Mohamad Fakih; Rouwaida Kanj; Fadi Kurdahi; Mohammed E. Fouda; | arxiv-cs.CR | 2023-09-20 |
461 | Directional Source Separation for Robust Speech Recognition on Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve voice quality, this work investigates directional source separation using the multi-microphone array. |
TIANTIAN FENG et. al. | arxiv-cs.SD | 2023-09-19 |
462 | Instruction-Following Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the mechanisms behind these models’ speech understanding and reasoning capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. |
Cheng-I Jeff Lai; Zhiyun Lu; Liangliang Cao; Ruoming Pang; | arxiv-cs.CL | 2023-09-18 |
463 | HypR: A Comprehensive Study for ASR Hypothesis Revising with A Reference Corpus Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Accordingly, we first concentrate on providing an ASR hypothesis revising (HypR) dataset in this study. |
Yi-Wei Wang; Ke-Han Lu; Kuan-Yu Chen; | arxiv-cs.CL | 2023-09-18 |
464 | BIGOS – Benchmark Intended Grouping of Open Speech Corpora for Polish Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents a Benchmark Intended Grouping of Open Speech (BIGOS), a new corpus designed for Polish Automatic Speech Recognition (ASR) systems. This initial version of the … |
Michał Junczyk; | 2023 18th Conference on Computer Science and Intelligence … | 2023-09-17 |
465 | Are Soft Prompts Good Zero-shot Learners for Speech Recognition? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). |
DIANWEN NG et. al. | arxiv-cs.SD | 2023-09-17 |
466 | Open Vocabulary Keyword Spotting with Small-Footprint ASR-based Architecture and Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present the results of experiments on minimizing the model size for the text-based Open Vocabulary Keyword Spotting task. The main goal is to perform inference on devices with … |
Mikołaj Pudo; Mateusz Wosik; Artur Janicki; | 2023 18th Conference on Computer Science and Intelligence … | 2023-09-17 |
467 | Augmenting Conformers with Structured State-space Sequence Models for Online Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. |
HAOZHE SHAN et. al. | arxiv-cs.CL | 2023-09-15 |
468 | Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. |
YANG LI et. al. | arxiv-cs.LG | 2023-09-14 |
469 | Echotune: A Modular Extractor Leveraging The Variable-Length Nature of Speech in ASR Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. |
Sizhou Chen; Songyang Gao; Sen Fang; | arxiv-cs.SD | 2023-09-14 |
470 | CPPF: A Contextual and Post-processing-free Model for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. |
LEI ZHANG et. al. | arxiv-cs.CL | 2023-09-13 |
471 | SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches … |
HAOXU WANG et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-09-11 |
472 | SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. |
HAOXU WANG et. al. | arxiv-cs.SD | 2023-09-11 |
473 | Leveraging Large Language Models for Exploiting ASR Uncertainty IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. |
PRANAY DIGHE et. al. | arxiv-cs.CL | 2023-09-09 |
474 | Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. |
Huaibo Zhao; Yosuke Higuchi; Yusuke Kida; Tetsuji Ogawa; Tetsunori Kobayashi; | arxiv-cs.SD | 2023-09-08 |
475 | LanSER: Language-Model Supported Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. |
TAESIK GONG et. al. | arxiv-cs.CL | 2023-09-07 |
476 | Bring The Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. |
Patrick Eickhoff; Matthias Möller; Theresa Pekarek Rosin; Johannes Twiefel; Stefan Wermter; | arxiv-cs.CL | 2023-09-05 |
477 | SememeASR: Boosting Performance of End-to-End Speech Recognition Against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Considering that knowledge-driven approaches can help data-driven approaches alleviate their flaws, we introduce sememe-based semantic knowledge information to speech recognition (SememeASR). |
Jiaxu Zhu; Changhe Song; Zhiyong Wu; Helen Meng; | arxiv-cs.SD | 2023-09-04 |
478 | Room Adaptation of Training Data for Distant Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present a novel signal processing-based approach for estimating room impulse responses for augmentation of ASR training data that is best suited to the reverberation … |
James Fosburgh; D. Sharma; P. Naylor; | 2023 31st European Signal Processing Conference (EUSIPCO) | 2023-09-04 |
479 | Text-Only Domain Adaptation for End-to-End Speech Recognition Through Down-Sampling Acoustic Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. |
JIAXU ZHU et. al. | arxiv-cs.SD | 2023-09-04 |
480 | Boosting Low-Resource Speech Recognition in Air Traffic Communication Via Pretrained Feature Aggregation and Multi-Task Learning IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Developing a robust Automatic Speech Recognition (ASR) system usually requires a large amount of well-annotated samples which is extremely hard to build in the Air Traffic Control … |
Dongyue Guo; Zichen Zhang; Bo Yang; Jianwei Zhang; Yi Lin; | IEEE Transactions on Circuits and Systems II: Express Briefs | 2023-09-01 |
481 | Utilizing Automatic Speech Recognition for English Pronunciation Practice and Analyzing Its Impact Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The advancement of AI in recent years has been remarkable, along with the widespread use of speech recognition functions. In addition, an increasing number of people are … |
K. Umezawa; M. Nakazawa; Michiko Nakano; S. Hirasawa; | 2023 IEEE 12th International Conference on Engineering … | 2023-08-29 |
482 | ASTER: Automatic Speech Recognition System Accessibility Testing for Stutterers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the challenge, we propose ASTER, a technique for automatically testing the accessibility of ASR systems. |
YI LIU et. al. | arxiv-cs.SD | 2023-08-29 |
483 | Naaloss: Rethinking The Objective of Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss, to ameliorate the influence of artifacts from a novel perspective. |
Kuan-Hsun Ho; En-Lun Yu; Jeih-weih Hung; Berlin Chen; | arxiv-cs.SD | 2023-08-24 |
484 | Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a cross-modal global interaction and local alignment (GILA) approach for AVSR, which captures the deep audio-visual (A-V) correlations from both global and local perspectives. |
YUCHEN HU et. al. | ijcai | 2023-08-23 |
485 | Convoifilter: A Case Study of Doing Cocktail Party Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. |
Thai-Binh Nguyen; Alexander Waibel; | arxiv-cs.SD | 2023-08-22 |
486 | SeamlessM4T: Massively Multilingual & Multimodal Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. |
SEAMLESS COMMUNICATION et. al. | arxiv-cs.CL | 2023-08-22 |
487 | An Enhanced Method for Dialect Transcription Via Error-correcting Thesaurus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) has been widely used in the field of customer service, but the performance of general ASR in dialect transcription is not satisfactory, … |
Xiaoliang Ma; Congjian Deng; Dequan Du; Qingqi Pei; | IET Commun. | 2023-08-21 |
488 | Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children’s Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children’s speech. One cause is the high … |
Christopher Gebauer; Lars Rumberg; Hanna Ehlert; Ulrike Lüdtke; Jörn Ostermann; | Interspeech | 2023-08-20 |
489 | Using Commercial ASR Solutions to Assess Reading Skills in Children: A Case Report Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Reading is an acquired skill that is essential for integrating and participating in today’s society. Yet, becoming literate can be particularly laborious for some children. … |
TIMOTHY PITON et. al. | Interspeech | 2023-08-20 |
490 | Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing Based Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View |
ZHENG LIANG et. al. | Interspeech | 2023-08-20 |
491 | LABERT: A Combination of Local Aggregation and Self-Supervised Speech Representation Learning for Detecting Informative Hidden Units in Low-Resource ASR Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: With advances in deep learning methodologies, Automatic Speech Recognition (ASR) systems have seen impressive re-sults. However, ASR in Low-Resource Environments (LREs) are … |
Kavan Fatehi; Ayse Kucukyilmaz; | Interspeech | 2023-08-20 |
492 | Speaker Diarization for ASR Output with T-vectors: A Sequence Classification Approach Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper considers applying speaker diarization (SD) to the output tokens of automatic speech recognition (ASR). We formulate the task to be solved as a sequence classification … |
MIDIA YOUSEFI et. al. | Interspeech | 2023-08-20 |
493 | Improving The Response Timing Estimation for Spoken Dialogue Systems By Reducing The Effect of Speech Recognition Delay Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In conversational systems, the proper timing of the system’s response is critical to maintaining a comfortable conversation. To achieve appropriate timing estimation, it is … |
Jin Sakuma; S. Fujie; Huaibo Zhao; Tetsunori Kobayashi; | Interspeech | 2023-08-20 |
494 | A Conformer-based Classifier for Variable-length Utterance Processing in Anti-spoofing IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The success achieved by conformers in Automatic Speech Recognition (ASR) leads us to their application in other domains, such as spoofing detection for automatic speaker … |
Eros Rosello; Alejandro Gomez-Alanis; A. Gómez; A. Peinado; | Interspeech | 2023-08-20 |
495 | Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper addresses effective pretraining of automatic speech recognition (ASR) and gender recognition to improve wav2vec 2.0 embedding for speech emotion recognition (SER). … |
Yuan Gao; Chenhui Chu; Tatsuya Kawahara; | Interspeech | 2023-08-20 |
496 | ASR for Low Resource and Multilingual Noisy Code-Mixed Speech Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Developing reliable Automatic Speech Recognition (ASR) sys-tem for Indian Languages has been challenging due to the limited availability of large-scale, high-quality speech … |
Tushar Verma; Atul Shree; Ashutosh Modi; | Interspeech | 2023-08-20 |
497 | On Training A Neural Residual Acoustic Echo Suppressor for Improved ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Acoustic Echo Cancellation (AEC) is critical for accurate recognition of speech directed at a smart device playing audio. Previous work has showed that neural AEC models can … |
S. Panchapagesan; T. Shabestary; A. Narayanan; | Interspeech | 2023-08-20 |
498 | Noise-Robust Bandwidth Expansion for 8K Speech Recordings Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech recordings in call centers are narrowband and mixed with various noises. Developing a bandwidth expansion (BWE) model is important to mitigate the automated speech … |
YIN-TSE LIN et. al. | Interspeech | 2023-08-20 |
499 | A Neural Time Alignment Module for End-to-End Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end trainable (E2E) automatic speech recognition (ASR) systems have low word error rates, but they do not model timings or silence by default unlike hidden Markov model … |
Dongcheng Jiang; C. Zhang; P. Woodland; | Interspeech | 2023-08-20 |
500 | Improving Joint Speech and Emotion Recognition Using Global Style Tokens Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic speech recognition (ASR) and speech emotion recognition (SER) are closely related in that the acoustic features of speech, such as pitch, tone, and intensity, can vary … |
Jehyun Kyung; Ju-Seok Seong; Jeonghwan Choi; Ye-Rin Jeoung; Joon‐Hyuk Chang; | Interspeech | 2023-08-20 |
501 | I Learned Error, I Can Fix It! : A Detector-Corrector Structure for ASR Error Calibration Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech recognition technology has improved recently. However, in the context of spoken language understanding (SLU), containing automatic speech recognition (ASR) errors causes … |
Heuiyeen Yeen; Minju Kim; M. Koo; | Interspeech | 2023-08-20 |
502 | Joint Blind Source Separation and Dereverberation for Automatic Speech Recognition Using Delayed-Subsource MNMF with Localization Prior Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Overlapping speech and high room reverberation deteriorate the accuracy of automatic speech recognition (ASR). This paper proposes a method for jointly optimum source separation … |
Mieszko Fraś; Marcin Witkowski; K. Kowalczyk; | Interspeech | 2023-08-20 |
503 | Adapter-tuning with Effective Token-dependent Representation Shift for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The use of self-supervised pre-trained speech models has greatly improved speech tasks in low-resource settings. However, fine-tuning the entire model can be computationally … |
DIANWEN NG et. al. | Interspeech | 2023-08-20 |
504 | Speech Emotion Recognition Using Decomposed Speech Via Multi-task Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In speech emotion recognition, most recent studies used powerful models to obtain robust features without considering the disentangled components, which contain diverse … |
Jia-Hao Hsu; C. Wu; Yunchao Wei; | Interspeech | 2023-08-20 |
505 | Joint Autoregressive Modeling of End-to-End Multi-Talker Overlapped Speech Recognition and Utterance-level Timestamp Prediction Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper proposes autoregressive modeling of the joint multi-talker automatic speech recognition (ASR) and timestamp prediction. Autoregressive modeling of multi-talker ASR is a … |
Naoki Makishima; Keita Suzuki; Satoshi Suzuki; Atsushi Ando; Ryo Masumura; | Interspeech | 2023-08-20 |
506 | Unsupervised Code-switched Text Generation from Parallel Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual … |
JI-EUN CHI et. al. | Interspeech | 2023-08-20 |
507 | Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Knowledge about phonemes and their articulatory attributes can help improve automatic speech recognition (ASR) of low-resource languages. In this study, we propose a simple and … |
Jaeyoung Lee; M. Mimura; Tatsuya Kawahara; | Interspeech | 2023-08-20 |
508 | Automatic Speaker Recognition with Variation Across Vocal Conditions: A Controlled Experiment with Implications for Forensics Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speaker Recognition (ASR) involves a complex range of processes to extract, model, and compare speaker-specific information from a pair of voice samples. Using heavily … |
VINCENT HUGHES et. al. | Interspeech | 2023-08-20 |
509 | Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We release 840 hours of read speech multi-dialect ASR corpora consisting of 700 hours of main Thai dialect, named Thai-central, and 40 hours for each local dialect , named … |
Artit Suwanbandit; Burin Naowarat; Orathai Sangpetch; E. Chuangsuwanich; | Interspeech | 2023-08-20 |
510 | Effective Training of Attention-based Contextual Biasing Adapters with Synthetic Audio for Personalised ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Contextual biasing (CB) is an effective approach for contex-tualising hidden features of neural transducer ASR models to improve rare word recognition. CB relies on relatively … |
Burin Naowarat; Philip Harding; Pasquale D’Alterio; Sibo Tong; Bashar Awwad Shiekh Hasan; | Interspeech | 2023-08-20 |
511 | Speech-in-Speech Recognition Is Modulated By Familiarity to Dialect Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Listening to speech in competing background speech can be difficult due to elements such as the linguistic content of the signal. Linguistic release from masking occurs when … |
Jessica L. L. Chin; Elena Talevska; M. Antoniou; | Interspeech | 2023-08-20 |
512 | An Improved End-to-End Audio-Visual Speech Recognition Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: By incorporating lip language, audio-visual speech recognition can effectively improve the recognition effect in noisy environments, and will slightly improve the recognition … |
Sheng Yang; Zheng Gong; Jiacang Kang; | Interspeech | 2023-08-20 |
513 | Wav2vec 2.0 ASR for Cantonese-Speaking Older Adults in A Clinical Setting Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The lack of large-scale speech corpora for Cantonese and older adults has impeded the academia’s research of automatic speech recognition (ASR) systems for the two. On the other … |
Ranzo Huang; B. Mak; | Interspeech | 2023-08-20 |
514 | Human Transcription Quality Improvement Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: High quality transcription data is crucial for training automatic speech recognition (ASR) systems. However, the existing industry-level data collection pipelines are expensive to … |
Jian Gao; Hanbo Sun; Cheng Cao; Zheng Du; | Interspeech | 2023-08-20 |
515 | Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speakers with dysarthria could particularly benefit from assistive speech technology, but are underserved by current automatic speech recognition (ASR) systems. The differences of … |
Enno Hermann; Mathew Magimai; | Interspeech | 2023-08-20 |
516 | Information Magnitude Based Dynamic Sub-sampling for Speech-to-text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Attention-based models have achieved new state-of-the-art in many tasks while the computational cost of these models increases drastically compared with previous methods. For most … |
YUHAO ZHANG et. al. | Interspeech | 2023-08-20 |
517 | MiniStreamer: Enhancing Small Conformer with Chunked-Context Masking for Streaming ASR Applications on The Edge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Real-time applications of Automatic Speech Recognition (ASR) on user devices on the edge require streaming processing. Conformer model has achieved state-of-the-art performance in … |
Haris Gulzar; Monikka Roslianna Busto; Takeharu Eda; Katsutoshi Itoyama; K. Nakadai; | Interspeech | 2023-08-20 |
518 | Whisper Features for Dysarthric Severity-Level Classification Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Dysarthria is a speech disorder caused by improper coordination between the brain and the muscles that produce intelligible speech. Accurately diagnosing the severity of … |
Siddharth Rathod; Monil Charola; Akshat Vora; Yash Jogi; H. Patil; | Interspeech | 2023-08-20 |
519 | Silent Speech Recognition with Articulator Positions Estimated from Tongue Ultrasound and Lip Video Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present a multi-speaker silent speech recognition system trained on articulator features derived from the Tongue and Lips corpus, a multi-speaker corpus of ultrasound tongue … |
Rachel Beeson; Korin Richmond; | Interspeech | 2023-08-20 |
520 | Regarding Topology and Variant Frame Rates for Differentiable WFST-based End-to-End ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end (E2E) Automatic Speech Recognition (ASR) has gained popularity in recent years, with most research focusing on designing novel neural network architectures, speech … |
Zeyu Zhao; P. Bell; | Interspeech | 2023-08-20 |
521 | Dialect Speech Recognition Modeling Using Corpus of Japanese Dialects and Self-Supervised Learning-based Model XLSR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In order to utilize the large amount of historical speech resources for applications such as linguistic analysis and retrieval, automatic speech recognition technology that can … |
Shogo Miwa; A. Kai; | Interspeech | 2023-08-20 |
522 | TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TokenSplit, a speech separation model that acts on discrete token sequences. |
HAKAN ERDOGAN et. al. | arxiv-cs.SD | 2023-08-20 |
523 | Domain Adaptive Self-supervised Training of Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper explores domain adaptive self-supervised training of automatic speech recognition (ASR). Unlabeled data from the target domain can either be used in training the … |
Cong-Thanh Do; R. Doddipatla; Mohan Li; Thomas Hain; | Interspeech | 2023-08-20 |
524 | Data Augmentation for Children ASR and Child-adult Speaker Classification Using Voice Conversion Methods Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Many young children prefer speech based interfaces over text, as they are relatively slow and error-prone with text input. However, children ASR can be challenging due to the lack … |
Shuyang Zhao; Mittul Singh; Abraham Woubie; Reima Karhila; | Interspeech | 2023-08-20 |
525 | Exploring Sources of Racial Bias in Automatic Speech Recognition Through The Lens of Rhythmic Variation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Although studies have shown that one issue of bias in modern automatic speech recognition (ASR) technologies is degraded performance for African American English (AAE) speakers, … |
Li-Fang Lai; N. Holliday; | Interspeech | 2023-08-20 |
526 | Bayes Risk Transducer: Transducer with Controllable Alignment Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. |
JINCHUAN TIAN et. al. | arxiv-cs.CL | 2023-08-19 |
527 | Assessment of L2 Oral Proficiency Using Self-Supervised Speech Representation Learning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: A standard pipeline for automated spoken language assessment is to start with an automatic speech recognition (ASR) system and derive features that exploit transcriptions and … |
Stefano Bannò; K. Knill; M. Matassoni; Vyas Raina; M. Gales; | Slate | 2023-08-18 |
528 | Accurate Synthesis of Dysarthric Speech for ASR Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. |
Mohammad Soleymanpour; Michael T. Johnson; Rahim Soleymanpour; Jeffrey Berry; | arxiv-cs.SD | 2023-08-16 |
529 | An Ambient Intelligence-based Approach For Longitudinal Monitoring of Verbal and Vocal Depression Symptoms Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Another major challenge in depression relapse research is the scarcity of publicly available datasets. To overcome these issues, we propose a one-shot learning framework for detecting depression relapse from speech. |
Alice Othmani; Muhammad Muzammel; | arxiv-cs.HC | 2023-08-16 |
530 | Radio2Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, … |
Running Zhao; Luca Jiang-Tao Yu; H. Zhao; Edith C. H. Ngai; | Proceedings of the ACM on Interactive, Mobile, Wearable and … | 2023-08-16 |
531 | Radio2Text: Streaming Speech Recognition Using MmWave Radio Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. |
Running Zhao; Jiangtao Yu; Hang Zhao; Edith C. H. Ngai; | arxiv-cs.SD | 2023-08-15 |
532 | Using Text Injection to Improve Recognition of Personal Identifiers in Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. |
YOCHAI BLAU et. al. | arxiv-cs.CL | 2023-08-14 |
533 | Text Injection for Capitalization and Turn-Taking Prediction in Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. |
SHAAN BIJWADIA et. al. | arxiv-cs.CL | 2023-08-14 |
534 | A Novel Self-training Approach for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. |
Satwinder Singh; Feng Hou; Ruili Wang; | arxiv-cs.CL | 2023-08-09 |
535 | Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). |
Yang Zhang; Krishna C. Puvvada; Vitaly Lavrukhin; Boris Ginsburg; | arxiv-cs.SD | 2023-08-09 |
536 | The Role of Audio Features in Accent Recognition: A Comparative Analysis Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This study focuses on enhancing Automatic Speech Recognition (ASR) systems, crucial in Science and Technology, by addressing challenges tied to diverse speaker accents. The … |
Anik Biswas; | 2023 International Workshop on Intelligent Systems (IWIS) | 2023-08-09 |
537 | Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach that incorporates a dynamic error scaling mechanism to detect and correct phonetically erroneous text generated by ASR output. |
JIAXIN FAN et. al. | arxiv-cs.CL | 2023-08-07 |
538 | Federated Representation Learning for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. |
GURUPRASAD V RAMESH et. al. | arxiv-cs.SD | 2023-08-03 |
539 | Inaudible Adversarial Perturbation: Manipulating The Recognition of User Speech in Real Time Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we seek to bridge the gap in existing research and extend the attack to user-present scenarios. |
XINFENG LI et. al. | arxiv-cs.CR | 2023-08-02 |
540 | ÌròyìnSpeech: A Multi-purpose Yorùbá Speech Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce \`{I}r\`{o}y\`{i}nSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yor\`{u}b\'{a} speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. |
Tolulope Ogunremi; Kola Tubosun; Anuoluwapo Aremu; Iroro Orife; David Ifeoluwa Adelani; | arxiv-cs.CL | 2023-07-29 |
541 | The Timing Bottleneck: Why Timing and Overlap Are Mission-critical for Conversational User Interfaces, Speech Recognition and Dialogue Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). |
Andreas Liesenfeld; Alianda Lopez; Mark Dingemanse; | arxiv-cs.CL | 2023-07-28 |
542 | Cascaded Cross-Modal Transformer for Request and Complaint Detection Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We propose a novel cascaded cross-modal transformer (CCMT) that combines speech and text transcripts to detect customer requests and complaints in phone conversations. Our … |
Nicolae-Cătălin Ristea; Radu Tudor Ionescu; | Proceedings of the 31st ACM International Conference on … | 2023-07-27 |
543 | Modeling Spoken Information Queries for Virtual Assistants: Open Problems, Challenges and Opportunities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We discuss open problems and challenges with respect to modeling spoken information queries for virtual assistants, and list opportunities where Information Retrieval methods and research can be applied to improve the quality of virtual assistant speech recognition. |
Christophe Van Gysel; | sigir | 2023-07-25 |
544 | Boosting Punctuation Restoration with Data Generation and Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. |
VIET DAC LAI et. al. | arxiv-cs.CL | 2023-07-24 |
545 | Adaptation of Whisper Models to Child Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly … |
Rishabh Jain; Andrei Barcovschi; Mariam Yiwere; Peter Corcoran; H. Cucu; | ArXiv | 2023-07-24 |
546 | Code-Switched Urdu ASR for Noisy Telephonic Environment Using Data Centric Approach with Hybrid HMM and CNN-TDNN Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. |
Muhammad Danyal Khan; Raheem Ali; Arshad Aziz; | arxiv-cs.CL | 2023-07-24 |
547 | Robust Automatic Speech Recognition Via WavAugment Guided Phoneme Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). |
GEGE QI et. al. | arxiv-cs.SD | 2023-07-23 |
548 | Exploring The Integration of Speech Separation and Recognition with Self-Supervised Learning Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. |
YOSHIKI MASUYAMA et. al. | arxiv-cs.SD | 2023-07-23 |
549 | A Meta Learning Scheme for Fast Accent Domain Expansion in Mandarin Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. |
Ziwei Zhu; Changhao Shan; Bihong Zhang; Jian Yu; | arxiv-cs.SD | 2023-07-23 |
550 | Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. |
SUYOUN KIM et. al. | arxiv-cs.CL | 2023-07-22 |
551 | A Change of Heart: Improving Speech Emotion Recognition Through Speech-to-Text Modality Conversion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a modality conversion concept aimed at enhancing emotion recognition performance on the MELD dataset. |
Zeinab Sadat Taghavi; Ali Satvaty; Hossein Sameti; | arxiv-cs.SD | 2023-07-21 |
552 | A Deep Dive Into The Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. |
Anand Kumar Rai; Siddharth D Jaiswal; Animesh Mukherjee; | arxiv-cs.CL | 2023-07-20 |
553 | Room Acoustic Characterization with Smartphone-Based Automated Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Characterizing and monitoring the acoustic quality of a room is important for maintaining effective speech communication. Noise and echoes make speech harder to perceive, … |
Brady Laska; Bruce Wallace; Abagael Hudak; Rafik Goubran; | 2023 IEEE Sensors Applications Symposium (SAS) | 2023-07-18 |
554 | Ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: We introduceivrit.ai, a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) … |
Yanir Marmor; Kinneret Misgav; Y. Lifshitz; | ArXiv | 2023-07-17 |
555 | Replay to Remember: Continual Layer-Specific Fine-tuning for German Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay for continual learning. |
Theresa Pekarek Rosin; Stefan Wermter; | arxiv-cs.CL | 2023-07-14 |
556 | SGGNet$^2$: Speech-Scene Graph Grounding Network for Speech-guided Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel speech-scene graph grounding network (SGGNet$^2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. |
DOHYUN KIM et. al. | arxiv-cs.RO | 2023-07-14 |
557 | SGGNet2: Speech-Scene Graph Grounding Network for Speech-guided Navigation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately … |
DOHYUN KIM et. al. | 2023 32nd IEEE International Conference on Robot and Human … | 2023-07-14 |
558 | Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling Related Papers Related Patents Related Grants Related Venues Related Experts |