Paper Digest: Recent Papers on Speech Recognition
Paper Digest Team extracted all recent Speech Recognition related papers on our radar, and generated highlight sentences for them. The results are then sorted by relevance & date. In addition to this ‘static’ page, we also provide a real-time version of this article, which has more coverage and is updated in real time to include the most recent updates on this topic.
Based in New York, Paper Digest is dedicated to producing high-quality text analysis results that people can acturally use on a daily basis. Since 2018, we have been serving users across the world with a number of exclusive services to track, search, review and write scientific literature.
You are welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Paper Digest: Recent Papers on Speech Recognition
Paper | Author(s) | Source | Date | |
---|---|---|---|---|
1 | WER We Stand: Benchmarking Urdu ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. |
Samee Arif; Aamina Jamal Khan; Mustafa Abbas; Agha Ali Raza; Awais Athar; | arxiv-cs.CL | 2024-09-17 |
2 | Speech Recognition for Analysis of Police Radio Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate the performance of off-the-shelf speech recognizers, models fine-tuned on BPC data, and customized end-to-end models. We find that both human and machine transcription is challenging in this domain. |
Tejes Srivastava; Ju-Chieh Chou; Priyank Shroff; Karen Livescu; Christopher Graziul; | arxiv-cs.SD | 2024-09-16 |
3 | Augmenting Automatic Speech Recognition Models with Disfluency Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present an inference-only approach to augment any ASR model with the ability to detect open-set disfluencies. |
Robin Amann; Zhaolin Li; Barbara Bruno; Jan Niehues; | arxiv-cs.CL | 2024-09-16 |
4 | Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. |
CHAO-HAN HUCK YANG et. al. | arxiv-cs.CL | 2024-09-15 |
5 | Exploring SSL Discrete Tokens for Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a comprehensive comparison of discrete tokens generated by various leading SSL models across multiple language domains. |
MINGYU CUI et. al. | arxiv-cs.CL | 2024-09-13 |
6 | Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. |
LINGWEI MENG et. al. | arxiv-cs.CL | 2024-09-13 |
7 | LA-RAG:Enhancing LLM-based ASR Accuracy with Retrieval-Augmented Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, existing methods often constrained by the capabilities of the speech encoders under varied acoustic conditions, such as accents. To address this, we propose LA-RAG, a novel Retrieval-Augmented Generation (RAG) paradigm for LLM-based ASR. |
SHAOJUN LI et. al. | arxiv-cs.SD | 2024-09-13 |
8 | M$^{3}$V: A Multi-modal Multi-view Approach for Device-Directed Speech Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR). To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. |
ANNA WANG et. al. | arxiv-cs.SD | 2024-09-13 |
9 | Full-text Error Correction for Chinese Speech Recognition with Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-09-12 |
10 | WhisperNER: Unified Open Named Entity and Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. |
GIL AYACHE et. al. | arxiv-cs.CL | 2024-09-12 |
11 | The Faetar Benchmark: Speech Recognition in A Very Under-Resourced Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. |
MICHAEL ONG et. al. | arxiv-cs.CL | 2024-09-12 |
12 | Enhancing CTC-Based Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). |
Hendrik Laux; Anke Schmeink; | arxiv-cs.CV | 2024-09-11 |
13 | Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, this work extends SummaryMixing to a Conformer Transducer that works in both a streaming and an offline mode. |
Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Batthacharya; | arxiv-cs.SD | 2024-09-11 |
14 | Keyword-Aware ASR Error Augmentation for Robust Dialogue State Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a simple yet effective data augmentation method that targets those entities to improve the robustness of DST model. |
Jihyun Lee; Solee Im; Wonjun Lee; Gary Geunbae Lee; | arxiv-cs.CL | 2024-09-10 |
15 | An Effective Context-Balanced Adaptation Approach for Long-Tailed Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In light of this, we explore in-depth the impact of altering the context list to have words with different frequency distributions on model performance, and meanwhile extend CA with a simple yet effective context-balanced learning objective. A series of experiments conducted on the AISHELL-1 benchmark dataset suggests that using all vocabulary words from the training corpus as the context list and pairing them with our balanced objective yields the best performance, demonstrating a significant reduction in character error rate (CER) by up to 1.21% and a more pronounced 9.44% reduction in the error rate of zero-shot words. |
YI-CHENG WANG et. al. | arxiv-cs.CL | 2024-09-10 |
16 | What Is Lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, by removing inconsistencies such as variations in spelling, punctuation, and special characters, are fundamentally flawed when applied to Indic scripts. Through empirical analysis using text similarity scores and in-depth linguistic examination, we demonstrate that these flaws lead to artificially inflated performance metrics for Indic languages. |
Kavya Manohar; Leena G Pillai; | arxiv-cs.CL | 2024-09-04 |
17 | Quantification of Stylistic Differences in Human- and ASR-produced Transcripts of African American English Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We categorize the kinds of stylistic differences between 6 transcription versions, 4 human- and 2 ASR-produced, of 10 hours of African American English (AAE) speech. Focusing on verbatim features and AAE morphosyntactic features, we investigate the interactions of these categories with how well transcripts can be compared via word error rate (WER). |
ANNIKA HEUSER et. al. | arxiv-cs.CL | 2024-09-04 |
18 | Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. |
Hao Shi; Yuan Gao; Zhaoheng Ni; Tatsuya Kawahara; | arxiv-cs.SD | 2024-09-01 |
19 | LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. |
ZENGRUI JIN et. al. | arxiv-cs.SD | 2024-09-01 |
20 | Comparing Discrete and Continuous Space LLMs for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates discrete and continuous speech representations in Large Language Model (LLM)-based Automatic Speech Recognition (ASR), organizing them by feature continuity and training approach into four categories: supervised and unsupervised for both discrete and continuous types. |
Yaoxun Xu; Shi-Xiong Zhang; Jianwei Yu; Zhiyong Wu; Dong Yu; | arxiv-cs.CL | 2024-09-01 |
21 | ProGRes: Prompted Generative Rescoring on ASR N-Best Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel method that uses instruction-tuned LLMs to dynamically expand the n-best speech recognition hypotheses with new hypotheses generated through appropriately-prompted LLMs. |
Ada Defne Tur; Adel Moumen; Mirco Ravanelli; | arxiv-cs.CL | 2024-08-30 |
22 | Measuring The Accuracy of Automatic Speech Recognition Solutions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: At the same time the DHH community reports serious issues with the accuracy and reliability of ASR. |
Korbinian Kuhn; Verena Kersken; Benedikt Reuter; Niklas Egger; Gottfried Zimmermann; | arxiv-cs.CL | 2024-08-29 |
23 | Speech Recognition Transformers: Topological-lingualism Perspective Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The paper presents a comprehensive survey of transformer techniques oriented in speech modality. |
Shruti Singh; Muskaan Singh; Virender Kadyan; | arxiv-cs.CL | 2024-08-27 |
24 | MEDSAGE: Enhancing Robustness of Medical Dialogue Summarization to ASR Errors with LLM-generated Synthetic Dialogues Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Employing conventional data augmentation for enhancing the noise robustness of summarization models is not feasible either due to the unavailability of sufficient medical dialogue audio recordings and corresponding ASR transcripts. To address this challenge, we propose MEDSAGE, an approach for generating synthetic samples for data augmentation using Large Language Models (LLMs). |
KULUHAN BINICI et. al. | arxiv-cs.CL | 2024-08-26 |
25 | Self-supervised Speech Representations Still Struggle with African American Vernacular English Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We investigate whether or not the recent wave of Self-Supervised Learning (SSL) speech models can close the gap in ASR performance between AAVE and Mainstream American English (MAE). We evaluate four SSL models (wav2vec 2.0, HuBERT, WavLM, and XLS-R) on zero-shot Automatic Speech Recognition (ASR) for these two varieties and find that these models perpetuate the bias in performance against AAVE. |
KALVIN CHANG et. al. | arxiv-cs.CL | 2024-08-26 |
26 | Developing Vocal System Impaired Patient-aimed Voice Quality Assessment Approach Using ASR Representation-included Multiple Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. |
SHAOXIANG DANG et. al. | arxiv-cs.SD | 2024-08-22 |
27 | Towards Measuring Fairness in Speech Recognition: Fair-Speech Dataset Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel dataset, Fair-Speech, a publicly released corpus to help researchers evaluate their ASR models for accuracy across a diverse set of self-reported demographic information, such as age, gender, ethnicity, geographic variation and whether the participants consider themselves native English speakers. |
IRINA-ELENA VELICHE et. al. | arxiv-cs.AI | 2024-08-22 |
28 | The State of Commercial Automatic French Legal Speech Recognition Systems and Their Impact on Court Reporters Et Al Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We benchmark three ASR models, including commercial and open-source options, on their ability to recognize French legal speech using a curated dataset. Our study evaluates the performance of these systems using the Word Error Rate (WER) metric and introduces the Sonnex Distance to account for phonetic accuracy. |
Nicolad Garneau; Olivier Bolduc; | arxiv-cs.CL | 2024-08-21 |
29 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; J�r�me Goulian; Benjamin Lecouteux; | acl | 2024-08-20 |
30 | StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose StreamSpeech, a direct Simul-S2ST model that jointly learns translation and simultaneous policy in a unified framework of multi-task learning. |
SHAOLEI ZHANG et. al. | acl | 2024-08-20 |
31 | Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluate the error predictors in two ways: first by predicting the errors made by a Switchboard ASR system on unseen data (Fisher), and then using that same predictor to estimate the behavior of an unrelated cloud-based ASR system on a novel task. |
Prashant Serai; Peidong Wang; Eric Fosler-Lussier; | arxiv-cs.AI | 2024-08-20 |
32 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn�t Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | acl | 2024-08-20 |
33 | Error-preserving Automatic Speech Recognition of Young English Learners� Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the mistakes made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their mistakes. |
JANICK MICHOT et. al. | acl | 2024-08-20 |
34 | CopyNE: Better Contextual ASR By Copying Named Entities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we treat entities as indivisible wholes and introduce the idea of copying into ASR. |
SHILIN ZHOU et. al. | acl | 2024-08-20 |
35 | XCB: An Effective Contextual Biasing Approach to Bias Cross-lingual Phrases in Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these models often struggle with bilingual settings, which are prevalent in code-switching speech recognition. In this study, we make the initial attempt to address this challenge by introducing a Cross-lingual Contextual Biasing(XCB) module. |
Xucheng Wan; Naijun Zheng; Kai Liu; Huan Zhou; | arxiv-cs.CL | 2024-08-20 |
36 | A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, during speech recognition in noisy environments, we observed the presence of illusions and repetition issues in audio-LLM, leading to substitution and insertion errors. This paper proposes a transcription prompt-based audio-LLM by introducing an ASR expert as a transcription tokenizer and a hybrid Autoregressive (AR) Non-autoregressive (NAR) decoding approach to solve the above problems. |
YANGZE LI et. al. | arxiv-cs.SD | 2024-08-18 |
37 | Enhancing Dialogue Speech Recognition with Robust Contextual Awareness Via Noise Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy. |
Wonjun Lee; San Kim; Gary Geunbae Lee; | arxiv-cs.CL | 2024-08-12 |
38 | Audio Enhancement for Computer Audition — An Iterative Training Paradigm Using Sample Importance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. |
Manuel Milling; Shuo Liu; Andreas Triantafyllopoulos; Ilhan Aslan; Björn W. Schuller; | arxiv-cs.SD | 2024-08-12 |
39 | LI-TTA: Language Informed Test-Time Adaptation for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a key limitation of this self-supervision lies in its primary focus on acoustic features, with minimal attention to the linguistic properties of the input. To address this gap, we propose Language Informed Test-Time Adaptation (LI-TTA), which incorporates linguistic insights during TTA for ASR. |
Eunseop Yoon; Hee Suk Yoon; John Harvill; Mark Hasegawa-Johnson; Chang D. Yoo; | arxiv-cs.CL | 2024-08-11 |
40 | MooER: LLM-based Speech Recognition and Translation Models from Moore Threads Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present MooER, a LLM-based large-scale automatic speech recognition (ASR) / automatic speech translation (AST) model of Moore Threads. |
JUNHAO XU et. al. | arxiv-cs.CL | 2024-08-09 |
41 | Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present accent clustering and mining schemes for fair speech recognition systems which can perform equally well on under-represented accented speech. |
JAEYOUNG KIM et. al. | arxiv-cs.SD | 2024-08-05 |
42 | ALIF: Low-Cost Adversarial Audio Attacks on Black-Box Speech Platforms Using Linguistic Features Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, AE-based adversarial audio samples are susceptible to ASR updates. In this paper, we identify the root cause of these limitations, namely the inability to construct AE attack samples directly around the decision boundary of deep learning (DL) models. |
PENG CHENG et. al. | arxiv-cs.CR | 2024-08-03 |
43 | MECOS: A Bilingual Manipuri-English Spontaneous Code-switching Speech Corpus for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View |
Naorem Karline Singh; Y. J. Chanu; Hoomexsun Pangsatabam; | Comput. Speech Lang. | 2024-08-01 |
44 | On The Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use the comparison of five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training. |
Nick Rossenbach; Ralf Schlüter; Sakriani Sakti; | arxiv-cs.CL | 2024-07-31 |
45 | Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. |
KOHEI MATSUURA et. al. | arxiv-cs.CL | 2024-07-31 |
46 | On The Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we evaluate the utility of synthetic data for training automatic speech recognition (ASR). |
Nick Rossenbach; Benedikt Hilmes; Ralf Schlüter; | arxiv-cs.CL | 2024-07-25 |
47 | Improving Domain-Specific ASR with LLM-Generated Contextual Descriptions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. |
Jiwon Suh; Injae Na; Woohwan Jung; | arxiv-cs.CL | 2024-07-25 |
48 | Coupling Speech Encoders with Downstream Text Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a modular approach to building cascade speech translation (AST) models that guarantees that the resulting model performs no worse than the 1-best cascade baseline while preserving state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task. |
Ciprian Chelba; Johan Schalkwyk; | arxiv-cs.CL | 2024-07-24 |
49 | Quantifying The Role of Textual Predictability in Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use this method to demonstrate that a Wav2Vec 2.0-based model makes greater stronger use of textual context than a hybrid ASR model, in spite of not using an explicit language model, and also use it to shed light on recent results demonstrating poor performance of standard ASR systems on African-American English. We demonstrate that these mostly represent failures of acoustic–phonetic modelling. |
Sean Robertson; Gerald Penn; Ewan Dunbar; | arxiv-cs.CL | 2024-07-23 |
50 | Evolutionary Prompt Design for LLM-Based Post-ASR Error Correction Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Building upon the strength of modern large language models (LLMs), generative error correction (GEC) has emerged as a promising paradigm that can elevate the performance of modern … |
Rithik Sachdev; Zhong-Qiu Wang; Chao-Han Huck Yang; | arxiv-cs.CL | 2024-07-23 |
51 | DMel: Speech Tokenization Made Simple Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Using a transformer decoder-only architecture for speech-text modeling, we comprehensively evaluate different speech tokenization methods on speech recognition (ASR), speech synthesis (TTS). |
HE BAI et. al. | arxiv-cs.CL | 2024-07-22 |
52 | SELM: Enhancing Speech Emotion Recognition for Out-of-Domain Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As an instance of this approach, we present SELM, an audio-conditioned language model for SER that predicts different emotion views. |
Hazim Bukhari; Soham Deshmukh; Hira Dhamyal; Bhiksha Raj; Rita Singh; | arxiv-cs.SD | 2024-07-21 |
53 | Low-Resourced Speech Recognition for Iu Mien Language Via Weakly-Supervised Phoneme-based Multilingual Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. |
LUKUAN DONG et. al. | arxiv-cs.SD | 2024-07-18 |
54 | Reexamining Racial Disparities in Automatic Speech Recognition Performance: The Role of Confounding By Provenance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic speech recognition (ASR) models trained on large amounts of audio data are now widely used to convert speech to written text in a variety of applications from video captioning to automated assistants used in healthcare and other domains. |
Changye Li; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-07-18 |
55 | Beyond Binary: Multiclass Paraphasia Detection with Generative Pretrained Transformers and End-to-End Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present novel approaches that use a generative pretrained transformer (GPT) to identify paraphasias from transcripts as well as two end-to-end approaches that focus on modeling both automatic speech recognition (ASR) and paraphasia classification as multiple sequences vs. a single sequence. |
Matthew Perez; Aneesha Sampath; Minxue Niu; Emily Mower Provost; | arxiv-cs.CL | 2024-07-15 |
56 | Textless Dependency Parsing By Labeled Sequence Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although their effectiveness is shown in capturing acoustic features, it is unclear in capturing lexical knowledge. This paper proposes a textless method for dependency parsing, examining its effectiveness and limitations. |
Shunsuke Kando; Yusuke Miyao; Jason Naradowsky; Shinnosuke Takamichi; | arxiv-cs.CL | 2024-07-14 |
57 | CUSIDE-T: Chunking, Simulating Future and Decoding for Transducer Based Streaming ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present CUSIDE-T, which successfully adapts the CUSIDE method over the recurrent neural network transducer (RNN-T) ASR architecture, instead of being based on the CTC architecture. |
Wenbo Zhao; Ziwei Li; Chuan Yu; Zhijian Ou; | arxiv-cs.SD | 2024-07-14 |
58 | Empowering Whisper As A Joint Multi-Talker and Target-Talker Speech Recognition System Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a pioneering approach to empower Whisper, which is a speech foundation model, to tackle joint multi-talker and target-talker speech recognition tasks. |
LINGWEI MENG et. al. | arxiv-cs.SD | 2024-07-13 |
59 | HebDB: A Weakly Supervised Dataset for Hebrew Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. |
ARNON TURETZKY et. al. | arxiv-cs.CL | 2024-07-10 |
60 | LearnerVoice: A Dataset of Non-Native English Learners’ Spontaneous Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner’s Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. |
HAECHAN KIM et. al. | arxiv-cs.CL | 2024-07-05 |
61 | Controlling Whisper: Universal Acoustic Adversarial Attacks to Control Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With the development of audio-prompted LLMs there is the potential for even greater control options. In this work we demonstrate that with this greater flexibility the systems can be susceptible to model-control adversarial attacks. |
Vyas Raina; Mark Gales; | arxiv-cs.SD | 2024-07-05 |
62 | TokenVerse: Unifying Speech and NLP Tasks Via Transducer-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our paper introduces TokenVerse, a single Transducer-based model designed to handle multiple tasks. |
SHASHI KUMAR et. al. | arxiv-cs.CL | 2024-07-05 |
63 | Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This study yields numerous significant findings that we are discussing in this paper. |
Salima Mdhaffar; Haroun Elleuch; Fethi Bougares; Yannick Estève; | arxiv-cs.CL | 2024-07-05 |
64 | Romanization Encoding For Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce romanization encoding for script-heavy languages to optimize multilingual and code-switching Automatic Speech Recognition (ASR) systems. |
WEN DING et. al. | arxiv-cs.CL | 2024-07-05 |
65 | Improving Accented Speech Recognition Using Data Augmentation Based on Unsupervised Text-to-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition. |
Cong-Thanh Do; Shuhei Imai; Rama Doddipatla; Thomas Hain; | arxiv-cs.CL | 2024-07-04 |
66 | FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). |
KEYU AN et. al. | arxiv-cs.SD | 2024-07-04 |
67 | Improving Self-supervised Pre-training Using Accent-Specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose an accent-aware adaptation technique for self-supervised learning that introduces a trainable set of accent-specific codebooks to the self-supervised architecture. |
Darshan Prabhu; Abhishek Gupta; Omkar Nitsure; Preethi Jyothi; Sriram Ganapathy; | arxiv-cs.CL | 2024-07-04 |
68 | Finetuning End-to-End Models for Estonian Conversational Spoken Language Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. |
Tiia Sildam; Andra Velve; Tanel Alumäe; | arxiv-cs.CL | 2024-07-04 |
69 | Qifusion-Net: Layer-adapted Stream/Non-stream Model for End-to-End Multi-Accent Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. |
Jinming Chen; Jingyi Fang; Yuanzhong Zheng; Yaoxuan Wang; Haojun Fei; | arxiv-cs.SD | 2024-07-03 |
70 | Pinyin Regularization in Error Correction for Chinese Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Subsequently, we conduct a preliminary evaluation using the dataset for both direct-prompting and fine-tuning pre-trained LLMs. |
Zhiyuan Tang; Dong Wang; Shen Huang; Shidong Shang; | arxiv-cs.CL | 2024-07-01 |
71 | Less Is More: Accurate Speech Recognition & Translation Without Web-Scale Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We argue that state-of-the art accuracy can be reached without relying on web-scale data. |
KRISHNA C. PUVVADA et. al. | arxiv-cs.CL | 2024-06-28 |
72 | Zero-Query Adversarial Attack on Black-box Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ZQ-Attack, a transfer-based adversarial attack on ASR systems in the zero-query black-box setting. |
ZHENG FANG et. al. | arxiv-cs.CR | 2024-06-27 |
73 | Enhanced ASR Robustness to Packet Loss with A Front-End Adaptation Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose using a front-end adaptation network connected to a frozen ASR model. |
Yehoshua Dissen; Shiry Yonash; Israel Cohen; Joseph Keshet; | arxiv-cs.SD | 2024-06-27 |
74 | ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. |
Ahmed Heakl; Youssef Zaghloul; Mennatullah Ali; Rania Hossam; Walid Gomaa; | arxiv-cs.CL | 2024-06-26 |
75 | Automatic Speech Recognition for Hindi Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations. |
Anish Saha; A. G. Ramakrishnan; | arxiv-cs.CL | 2024-06-26 |
76 | Streaming Decoder-Only Automatic Speech Recognition with Discrete Speech Units: A Pilot Study Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. |
Peikun Chen; Sining Sun; Changhao Shan; Qing Yang; Lei Xie; | arxiv-cs.SD | 2024-06-26 |
77 | Dynamic Data Pruning for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. |
QIAO XIAO et. al. | arxiv-cs.CL | 2024-06-26 |
78 | MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. |
ADRIANA FERNANDEZ-LOPEZ et. al. | arxiv-cs.CV | 2024-06-25 |
79 | A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. |
VAN TUNG PHAM et. al. | arxiv-cs.LG | 2024-06-25 |
80 | SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. |
Shuaishuai Ye; Shunfei Chen; Xinhui Hu; Xinkang Xu; | arxiv-cs.SD | 2024-06-25 |
81 | FASA: A Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children’s speech data from many of the existing noisy children’s speech data. |
Dancheng Liu; Jinjun Xiong; | arxiv-cs.CL | 2024-06-25 |
82 | Sequential Editing for Lifelong Training of Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. |
Devang Kulshreshtha; Saket Dingliwal; Brady Houston; Nikolaos Pappas; Srikanth Ronanki; | arxiv-cs.CL | 2024-06-25 |
83 | Blending LLMs Into Cascaded Speech Translation: KIT’s Offline Speech Translation System for IWSLT 2024 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present KIT’s offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. |
SAI KONERU et. al. | arxiv-cs.CL | 2024-06-24 |
84 | Exploring The Capability of Mamba in Speech Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we compared Mamba with state-of-the-art Transformer variants for various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. |
Koichi Miyazaki; Yoshiki Masuyama; Masato Murata; | arxiv-cs.SD | 2024-06-24 |
85 | Perception of Phonological Assimilation By Neural Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). |
Charlotte Pouw; Marianne de Heer Kloots; Afra Alishahi; Willem Zuidema; | arxiv-cs.CL | 2024-06-21 |
86 | Massive End-to-end Speech Recognition Models with Time Reduction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate massive end-to-end automatic speech recognition (ASR) models with efficiency improvements achieved by time reduction. |
WEIRAN WANG et. al. | naacl | 2024-06-20 |
87 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | naacl | 2024-06-20 |
88 | Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. |
Murali Karthick Baskar; Andrew Rosenberg; Bhuvana Ramabhadran; Neeraj Gaur; Zhong Meng; | arxiv-cs.AI | 2024-06-20 |
89 | Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a two-stage method, Contrastive and Consistency Learning (CCL), that correlates error patterns between clean and noisy ASR transcripts and emphasizes the consistency of the latent features of the two transcripts. |
Suyoung Kim; Jiyeon Hwang; Ho-Young Jung; | naacl | 2024-06-20 |
90 | Children’s Speech Recognition Through Discrete Token Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we investigate the integration of discrete speech tokens into children’s speech recognition systems as input without significantly degrading the ASR performance. |
Vrunda N. Sukhadia; Shammur Absar Chowdhury; | arxiv-cs.CL | 2024-06-19 |
91 | Joint Vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. |
Alexander Blatt; Aravind Krishnan; Dietrich Klakow; | arxiv-cs.CL | 2024-06-19 |
92 | ManWav: The First Manchu ASR Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In a pioneering effort, we introduce the first-ever Manchu ASR model ManWav, leveraging Wav2Vec2-XLSR-53. |
Jean Seo; Minha Kang; Sungjoo Byun; Sangah Lee; | arxiv-cs.CL | 2024-06-19 |
93 | Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. |
Adrien Pupier; Maximin Coavoux; Jérôme Goulian; Benjamin Lecouteux; | arxiv-cs.CL | 2024-06-18 |
94 | Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. |
Hayato Futami; Siddhant Arora; Yosuke Kashiwagi; Emiru Tsunoo; Shinji Watanabe; | arxiv-cs.CL | 2024-06-18 |
95 | Bridging The Gap: Integrating Pre-trained Speech Enhancement and Recognition Models for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. |
KUAN-CHEN WANG et. al. | arxiv-cs.SD | 2024-06-18 |
96 | CoSTA: Code-Switched Speech Translation Using Aligned Speech-Text Interleaving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text. |
Bhavani Shankar; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2024-06-16 |
97 | Imperceptible Rhythm Backdoor Attacks: Exploring Rhythm Transformation for Embedding Undetectable Vulnerabilities on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve the stealthiness of data poisoning, we propose a non-neural and fast algorithm called Random Spectrogram Rhythm Transformation (RSRT) in this paper. |
Wenhan Yao; Jiangkun Yang; Yongqiang He; Jia Liu; Weiping Wen; | arxiv-cs.SD | 2024-06-16 |
98 | Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper’s cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. |
Haoyu Wang; Guoqiang Hu; Guodong Lin; Wei-Qiang Zhang; Jian Li; | arxiv-cs.SD | 2024-06-14 |
99 | An Efficient Text Augmentation Approach for Contextualized Mandarin Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although contextualized automatic speech recognition (ASR) systems are commonly used to improve the recognition of uncommon words, their effectiveness is hindered by the inherent limitations of speech-text data availability. To address this challenge, our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models using a straightforward text-augmentation (TA) technique, all while keeping computational costs minimal. |
Naijun Zheng; Xucheng Wan; Kai Liu; Ziqing Du; Zhou Huan; | arxiv-cs.SD | 2024-06-14 |
100 | Speech ReaLLM — Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Speech ReaLLM, a new ASR architecture that marries decoder-only ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. |
FRANK SEIDE et. al. | arxiv-cs.CL | 2024-06-13 |
101 | Speech ReaLLM – Real-time Streaming Speech Recognition with Multimodal LLMs By Teaching The Flow of Time Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We introduce Speech ReaLLM, a new ASR architecture that marriesdecoder-onlyASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the … |
FRANK SEIDE et. al. | ArXiv | 2024-06-13 |
102 | LASER: Learning By Aligning Self-supervised Representations of Speech for Improving Content-related Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Recent attempts have been made to address this issue with cost-effective self-supervised fine-tuning (SSFT) approaches. Continuing in this direction, a cost-effective SSFT method named LASER: Learning by Aligning Self-supervised Representations is presented. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-06-13 |
103 | EffectiveASR: A Single-Step Non-Autoregressive Mandarin Speech Recognition Architecture with High Accuracy and Inference Speed Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a single-step NAR ASR architecture with high accuracy and inference speed, called EffectiveASR. |
ZIYANG ZHUANG et. al. | arxiv-cs.SD | 2024-06-13 |
104 | Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. |
Chihiro Taguchi; David Chiang; | arxiv-cs.CL | 2024-06-13 |
105 | Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a transcription-free method for joint training using only audio signals. |
WILLIAM RAVENSCROFT et. al. | arxiv-cs.SD | 2024-06-13 |
106 | Training Data Augmentation for Dysarthric Automatic Speech Recognition By Text-to-Dysarthric-Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic speech recognition (ASR) research has achieved impressive performance in recent years and has significant potential for enabling access for people with dysarthria (PwD) in augmentative and alternative communication (AAC) and home environment systems. |
Wing-Zin Leung; Mattias Cross; Anton Ragni; Stefan Goetze; | arxiv-cs.SD | 2024-06-12 |
107 | Towards Unsupervised Speech Recognition Without Pronunciation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora by proposing the removal of reliance on a phoneme lexicon. |
JUNRUI NI et. al. | arxiv-cs.CL | 2024-06-12 |
108 | ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents ML-SUPERB~2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. |
JIATONG SHI et. al. | arxiv-cs.SD | 2024-06-12 |
109 | The Interspeech 2024 Challenge on Speech Processing Using Discrete Units Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in this evolving field. |
XUANKAI CHANG et. al. | arxiv-cs.SD | 2024-06-11 |
110 | PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. |
TRANG LE et. al. | arxiv-cs.CL | 2024-06-11 |
111 | AS-70: A Mandarin Stuttered Speech Dataset for Automatic Speech Recognition and Stuttering Event Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the efficacy of these models diminishes when applied to atypical speech, such as stuttering. This paper introduces AS-70, the first publicly available Mandarin stuttered speech dataset, which stands out as the largest dataset in its category. |
RONG GONG et. al. | arxiv-cs.SD | 2024-06-11 |
112 | Reading Miscue Detection in Primary School Through Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We found that Hubert Large finetuned on Dutch speech achieves SOTA phoneme-level child speech recognition (PER at 23.1\%), while Whisper (Faster Whisper Large-v2) achieves SOTA word-level performance (WER at 9.8\%). |
Lingyun Gao; Cristian Tejedor-Garcia; Helmer Strik; Catia Cucchiarini; | arxiv-cs.CL | 2024-06-11 |
113 | MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling Methods for Learning Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. |
Hemant Yadav; Sunayana Sitaram; Rajiv Ratn Shah; | arxiv-cs.CL | 2024-06-09 |
114 | Hypernetworks for Personalizing ASR to Atypical Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Parameter-efficient fine-tuning (PEFT) for personalizing automatic speech recognition (ASR) has recently shown promise for adapting general population models to atypical speech. |
Max Müller-Eberstein; Dianna Yee; Karren Yang; Gautam Varma Mantena; Colin Lea; | arxiv-cs.LG | 2024-06-06 |
115 | Improving Zero-Shot Chinese-English Code-Switching ASR with KNN-CTC and Gated Monolingual Datastores Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although there is potential for performance improvement, a kNN-CTC model utilizing a single bilingual datastore can inadvertently introduce undesirable noise from the alternative language. To address this, we propose a novel kNN-CTC-based code-switching ASR (CS-ASR) framework that employs dual monolingual datastores and a gated datastore selection mechanism to reduce noise interference. |
JIAMING ZHOU et. al. | arxiv-cs.CL | 2024-06-06 |
116 | BLSP-Emo: Towards Empathetic Large Speech-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. |
CHEN WANG et. al. | arxiv-cs.CL | 2024-06-06 |
117 | Text Injection for Neural Contextual Biasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes contextual text injection (CTI) to enhance contextual ASR. |
ZHONG MENG et. al. | arxiv-cs.CL | 2024-06-05 |
118 | Error-preserving Automatic Speech Recognition of Young English Learners’ Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To give corrective feedback, which is a crucial part of language learning, the ASR systems in our setting need to preserve the errors made by the language learners. In this work, we build an ASR system that satisfies these requirements: it works on spontaneous speech by young language learners and preserves their errors. |
JANICK MICHOT et. al. | arxiv-cs.CL | 2024-06-05 |
119 | Discrete Multimodal Transformers with A Pretrained Large Language Model for Mixed-Supervision Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). |
VIET ANH TRINH et. al. | arxiv-cs.CL | 2024-06-04 |
120 | Efficiently Train ASR Models That Memorize Less and Perform Better with Per-core Clipping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work systematically investigates the impact of a specific granularity of gradient clipping, namely per-core clip-ping (PCC), across training a wide range of ASR models. |
LUN WANG et. al. | arxiv-cs.CR | 2024-06-04 |
121 | Whistle: Data-Efficient Multilingual and Crosslingual Speech Recognition Via Weakly Phonetic Supervision Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper explores the approach of pre-training with weakly phonetic supervision towards data-efficient MCL-ASR, which is called Whistle. |
Saierdaer Yusuyin; Te Ma; Hao Huang; Wenbo Zhao; Zhijian Ou; | arxiv-cs.SD | 2024-06-04 |
122 | Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks, which typically feature a single transcript associated with hours-long audios. |
Ara Yeroyan; Nikolay Karpov; | arxiv-cs.CL | 2024-06-03 |
123 | Pass The Butter: A Study on Desktop-classic Multitasking Robotic Arm Based on Advanced YOLOv7 and BERT Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to meet the current societal demand for service robot technology, this study proposes using a miniaturized desktop-level robot (by ROS) as a carrier, locally deploying a natural language model (NLP-BERT), and integrating visual recognition (CV-YOLO) and speech recognition technology (ASR-Whisper) as inputs to achieve autonomous decision-making and rational action by the desktop robot. |
HAOHUA QUE et. al. | arxiv-cs.RO | 2024-05-27 |
124 | Denoising LM: Pushing The Limits of Error Correction Models for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. |
ZIJIN GU et. al. | arxiv-cs.LG | 2024-05-24 |
125 | Let’s Fuse Step By Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Generative Fusion Decoding (GFD), a novel shallow fusion framework, utilized to integrate Large Language Models (LLMs) into multi-modal text recognition systems such as automatic speech recognition (ASR) and optical character recognition (OCR). |
CHAN-JAN HSU et. al. | arxiv-cs.CL | 2024-05-23 |
126 | You Don’t Understand Me!: Comparing ASR Results for L1 and L2 Speakers of Swedish IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. |
Ronald Cumbal; Birger Moell; Jose Lopes; Olof Engwall; | arxiv-cs.CL | 2024-05-22 |
127 | A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot’s ego speech using only a single-channel microphone. |
Yue Li; Florian A. Kunneman; Koen V. Hindriks; | arxiv-cs.HC | 2024-05-22 |
128 | Non-autoregressive Real-time Accent Conversion Model with Voice Cloning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We have developed the non-autoregressive model for real-time accent conversion with voice cloning. |
Vladimir Nechaev; Sergey Kosyakov; | arxiv-cs.SD | 2024-05-21 |
129 | Listen Again and Choose The Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. |
YUCHEN HU et. al. | arxiv-cs.CL | 2024-05-16 |
130 | Towards Evaluating The Robustness of Automatic Speech Recognition Systems Via Audio Style Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an attack on ASR systems based on user-customized style transfer. |
WEIFEI JIN et. al. | arxiv-cs.SD | 2024-05-15 |
131 | Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a simple yet effective method to learn a universal acoustic realization of Whisper’s $\texttt{<|endoftext|>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting’ the model. |
Vyas Raina; Rao Ma; Charles McGhee; Kate Knill; Mark Gales; | arxiv-cs.CL | 2024-05-09 |
132 | Lost in Transcription: Identifying and Quantifying The Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. |
DENA MUJTABA et. al. | arxiv-cs.CL | 2024-05-09 |
133 | The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. |
JINGGUANG TIAN et. al. | arxiv-cs.SD | 2024-05-08 |
134 | Open Implementation and Study of BEST-RQ for Speech Processing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. |
Ryan Whetten; Titouan Parcollet; Marco Dinarelli; Yannick Estève; | arxiv-cs.CL | 2024-05-07 |
135 | Mixat: A Data Set of Bilingual Emirati-English Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces Mixat: a dataset of Emirati speech code-mixed with English. |
Maryam Al Ali; Hanan Aldarmaki; | arxiv-cs.CL | 2024-05-04 |
136 | Unveiling The Potential of LLM-Based ASR on Chinese Open-Source Datasets Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. |
XUELONG GENG et. al. | arxiv-cs.SD | 2024-05-03 |
137 | Improving Membership Inference in ASR Model Auditing with Perturbed Loss Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. |
FRANCISCO TEIXEIRA et. al. | arxiv-cs.LG | 2024-05-02 |
138 | Low-resource Speech Recognition and Dialect Identification of Irish in A Multi-task Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). |
Liam Lonergan; Mengjie Qian; Neasa Ní Chiaráin; Christer Gobl; Ailbhe Ní Chasaide; | arxiv-cs.CL | 2024-05-02 |
139 | Efficient Compression of Multitask Multilingual Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. |
Thomas Palmeira Ferraz; | arxiv-cs.CL | 2024-05-01 |
140 | Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. |
Dongyuan Li; Ying Zhang; Yusong Wang; Funakoshi Kataro; Manabu Okumura; | arxiv-cs.SD | 2024-05-01 |
141 | Confides: A Visual Analytics Solution for Automated Speech Recognition Analysis and Exploration Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Confidence scores of automatic speech recognition (ASR) outputs are often inadequately communicated, preventing its seamless integration into analytical workflows. In this paper, we introduce ConFides, a visual analytic system developed in collaboration with intelligence analysts to address this issue. |
Sunwoo Ha; Chaehun Lim; R. Jordan Crouser; Alvitta Ottley; | arxiv-cs.HC | 2024-04-30 |
142 | Child Speech Recognition in Human-Robot Interaction: Problem Solved? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. |
RUBEN JANSSENS et. al. | arxiv-cs.CL | 2024-04-26 |
143 | Automatic Speech Recognition System-Independent Word Error Rate Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. |
Chanho Park; Mingjie Chen; Thomas Hain; | arxiv-cs.CL | 2024-04-25 |
144 | Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. |
Chihiro Taguchi; Jefferson Saransig; Dayana Velásquez; David Chiang; | arxiv-cs.CL | 2024-04-23 |
145 | Semantically Corrected Amharic Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. |
Samuael Adnew; Paul Pu Liang; | arxiv-cs.CL | 2024-04-20 |
146 | Jointly Recognizing Speech and Singing Voices Based on Multi-Task Audio Source Separation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. |
Ye Bai; Chenxing Li; Hao Li; Yuanyuan Zhao; Xiaorui Wang; | arxiv-cs.SD | 2024-04-17 |
147 | Generalization of Self-Supervised Learning-Based Representations for Cross-Domain Speech Emotion Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Self-supervised learning (SSL) from unlabelled speech data has revolutionized speech representation learning. Among them, wavLM, wav2vec2, HuBERT, and Data2vec have produced … |
Abinay Reddy Naini; Mary A. Kohler; Elizabeth Richerson; Donita Robinson; Carlos Busso; | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
148 | Extending Large Language Models for Speech and Audio Captioning Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multimodal large language models (LLMs) have shown promising visual perception abilities by connecting with image encoders, but their performance on auditory tasks has not yet … |
CHANGLI TANG et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2024-04-14 |
149 | Automatic Speech Recognition Advancements for Indigenous Languages of The Americas Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods. |
Monica Romero; Sandra Gomez; Ivan G. Torre; | arxiv-cs.CL | 2024-04-12 |
150 | An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. |
Tien-Hong Lo; Fu-An Chao; Tzu-I Wu; Yao-Ting Sung; Berlin Chen; | arxiv-cs.SD | 2024-04-11 |
151 | VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in The Medical Domain Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present VietMed – a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. |
Khai Le-Duc; | arxiv-cs.CL | 2024-04-08 |
152 | Mai Ho’omāuna I Ka ‘Ai: Language Models Improve Automatic Speech Recognition in Hawaiian Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we address the challenge of improving Automatic Speech Recognition (ASR) for a low-resource language, Hawaiian, by incorporating large amounts of independent text data into an ASR foundation model, Whisper. |
Kaavya Chaparala; Guido Zarrella; Bruce Torres Fischer; Larry Kimura; Oiwi Parker Jones; | arxiv-cs.CL | 2024-04-03 |
153 | BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose BRAVEn, an extension to the recent RAVEn method, which learns speech representations entirely from raw audio-visual data. |
Alexandros Haliassos; Andreas Zinonos; Rodrigo Mira; Stavros Petridis; Maja Pantic; | arxiv-cs.CV | 2024-04-02 |
154 | Noise Masking Attacks and Defenses for Pretrained Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They show that when a record has been seen at training time, the model will transcribe the noisy record with its memorized sensitive transcript. In our work, we extend these attacks beyond ASR models, to attack pretrained speech encoders. |
Matthew Jagielski; Om Thakkar; Lun Wang; | arxiv-cs.LG | 2024-04-02 |
155 | Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose Emotion Neural Transducer for fine-grained speech emotion recognition with automatic speech recognition (ASR) joint training. |
Siyuan Shen; Yu Gao; Feng Liu; Hanyang Wang; Aimin Zhou; | arxiv-cs.SD | 2024-03-28 |
156 | Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. |
YASH JAIN et. al. | arxiv-cs.CL | 2024-03-28 |
157 | DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic confusion for NEC on ASR transcription. |
Yi-Cheng Wang; Hsin-Wei Wang; Bi-Cheng Yan; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-03-26 |
158 | More Than Words: Advancements and Challenges in Speech Recognition for Singing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. |
Anna Kruspe; | arxiv-cs.SD | 2024-03-14 |
159 | A Review on Gujarati Language Based Automatic Speech Recognition (ASR) Systems Related Papers Related Patents Related Grants Related Venues Related Experts View |
Mohit Dua; Bhavesh Bhagat; Shelza Dua; N. Chakravarty; | Int. J. Speech Technol. | 2024-03-12 |
160 | Automatic Speech Recognition (ASR) for The Diagnosis of Pronunciation of Speech Sound Disorders in Korean Children Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. |
TAEKYUNG AHN et. al. | arxiv-cs.CL | 2024-03-12 |
161 | SpeechColab Leaderboard: An Open-Source Platform for Automatic Speech Recognition Evaluation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. |
Jiayu Du; Jinpeng Li; Guoguo Chen; Wei-Qiang Zhang; | arxiv-cs.CL | 2024-03-12 |
162 | Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. |
Yufeng Yang; Ashutosh Pandey; DeLiang Wang; | arxiv-cs.SD | 2024-03-10 |
163 | SCORE: Self-supervised Correspondence Fine-tuning for Improved Content Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. |
Amit Meghanani; Thomas Hain; | arxiv-cs.CL | 2024-03-10 |
164 | A New Benchmark for Evaluating Automatic Speech Recognition in The Arabic Call Domain Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work aims to establish a robust benchmark that not only encompasses the broad spectrum of Arabic dialects but also emulates the real-world conditions of call-based communications. |
QUSAI ABO OBAIDAH et. al. | arxiv-cs.AI | 2024-03-07 |
165 | Kirigami Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) … |
Sudershan Boovaraghavan; Haozhe Zhou; Mayank Goel; Yuvraj Agarwal; | Proceedings of the ACM on Interactive, Mobile, Wearable and … | 2024-03-06 |
166 | PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) … |
Joonas Kalda; Clément Pagés; R. Marxer; Tanel Alumäe; Hervé Bredin; | The Speaker and Language Recognition Workshop | 2024-03-04 |
167 | Automatic Speech Recognition Using Advanced Deep Learning Approaches: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This survey offers a comprehensive review of DTL, FL, and RL-based ASR frameworks, aiming to provide insights into the latest developments and aid researchers and professionals in understanding the current challenges. |
Hamza Kheddar; Mustapha Hemis; Yassine Himeur; | arxiv-cs.SD | 2024-03-02 |
168 | Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. |
Heyang Liu; Yu Wang; Yanfeng Wang; | arxiv-cs.CL | 2024-03-01 |
169 | Towards Inclusive Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View |
Siyuan Feng; B. Halpern; O. Kudina; O. Scharenborg; | Comput. Speech Lang. | 2024-03-01 |
170 | Probing The Information Encoded in Neural-based Acoustic Models of Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). |
Quentin Raymondaud; Mickael Rouvier; Richard Dufour; | arxiv-cs.SD | 2024-02-29 |
171 | Inappropriate Pause Detection In Dysarthric Speech Using Large-Scale Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose task design, labeling strategy, and a speech recognition model with an inappropriate pause prediction layer. |
Jeehyun Lee; Yerin Choi; Tae-Jin Song; Myoung-Wan Koo; | arxiv-cs.CL | 2024-02-29 |
172 | Exploration of Adapter for Noise Robust Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study thoroughly investigates adapter-based ASR adaptation in noisy environments. |
Hao Shi; Tatsuya Kawahara; | arxiv-cs.SD | 2024-02-28 |
173 | Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. |
Giuseppe Attanasio; Beatrice Savoldi; Dennis Fucci; Dirk Hovy; | arxiv-cs.CL | 2024-02-27 |
174 | Large Language Models Are Efficient Learners of Noise-Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do, where one solution is introducing noise information as a conditioner into LLM.The latest work proposes a GER benchmark with HyPoradise dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. |
YUCHEN HU et. al. | iclr | 2024-02-26 |
175 | An Effective Mixture-Of-Experts Approach For Code-Switching Speech Recognition Leveraging Encoder Disentanglement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus exclusively on improving the acoustic encoder of E2E ASR to tackle the challenge caused by the codeswitching phenomenon. |
Tzu-Ting Yang; Hsin-Wei Wang; Yi-Cheng Wang; Chi-Han Lin; Berlin Chen; | arxiv-cs.CL | 2024-02-26 |
176 | It’s Never Too Late: Fusing Acoustic Information Into Large Language Models for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). |
CHEN CHEN et. al. | iclr | 2024-02-26 |
177 | LipVoicer: Generating Speech from Silent Videos Guided By Lip Reading Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we present LipVoicer, a novel method that generates high-quality speech, even for in-the-wild and rich datasets, by incorporating the text modality. |
Yochai Yemini; Aviv Shamsian; Lior Bracha; Sharon Gannot; Ethan Fetaya; | iclr | 2024-02-26 |
178 | Not All Weights Are Created Equal: Enhancing Energy Efficiency in On-Device Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study delves into how weight parameters in speech recognition models influence the overall power consumption of these models. We discovered that the impact of weight parameters on power consumption varies, influenced by factors including how often they are invoked and their placement in memory. |
YANG LI et. al. | arxiv-cs.SD | 2024-02-20 |
179 | OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). |
Yifan Peng; Yui Sudo; Muhammad Shakeel; Shinji Watanabe; | arxiv-cs.CL | 2024-02-19 |
180 | An Embarrassingly Simple Approach for LLM with Strong ASR Capacity IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we focus on solving one of the most important tasks in the field of speech processing, i.e., automatic speech recognition (ASR), with speech foundation encoders and large language models (LLM). |
ZIYANG MA et. al. | arxiv-cs.CL | 2024-02-13 |
181 | The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This research represents a pioneering effort in quantifying biases in the Portuguese language context through the application of MMS and Whisper, contributing to a better understanding of ASR systems’ performance in multilingual settings. |
Ajinkya Kulkarni; Anna Tokareva; Rameez Qureshi; Miguel Couceiro; | arxiv-cs.CL | 2024-02-12 |
182 | Self-consistent Context Aware Conformer Transducer for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a novel neural network architecture based on conformer transducer that adds contextual information flow to the ASR systems. |
Konstantin Kolokolov; Pavel Pekichev; Karthik Raghunathan; | arxiv-cs.CL | 2024-02-09 |
183 | Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces an extensive speech-text LLM framework, the Unified Spoken Dialog Model (USDM), designed to generate coherent spoken responses with naturally occurring prosodic features relevant to the given input speech without relying on explicit automatic speech recognition (ASR) or text-to-speech (TTS) systems. |
HEESEUNG KIM et. al. | arxiv-cs.CL | 2024-02-08 |
184 | A Comprehensive Study of The Current State-of-the-Art in Nepali Automatic Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we examine the research conducted in the field of Nepali Automatic Speech Recognition (ASR). |
Rupak Raj Ghimire; Bal Krishna Bal; Prakash Poudyal; | arxiv-cs.SD | 2024-02-05 |
185 | Digits Micro-model for Accurate and Secure Transactions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present our work on creating micro models for multi-digit number recognition that handle diverse speaking styles reflecting real-world pronunciation patterns. |
Chirag Chhablani; Nikhita Sharma; Jordan Hosier; Vijay K. Gurbani; | arxiv-cs.LG | 2024-02-02 |
186 | Streaming Sequence Transduction Through Dynamic Compression Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce STAR (Stream Transduction with Anchor Representations), a novel Transformer-based model designed for efficient sequence-to-sequence transduction over streams. |
WEITING TAN et. al. | arxiv-cs.CL | 2024-02-02 |
187 | AccentFold: A Journey Through African Accents for Zero-Shot ASR Adaptation to Target Accents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While previous approaches have focused on modeling techniques or creating accented speech datasets, gathering sufficient data for the multitude of accents, particularly in the African context, remains impractical due to their sheer diversity and associated budget constraints. To address these challenges, we propose AccentFold, a method that exploits spatial relationships between learned accent embeddings to improve downstream Automatic Speech Recognition (ASR). |
Abraham Toluwase Owodunni; Aditya Yadavalli; Chris Chinenye Emezue; Tobi Olatunji; Clinton C Mbataku; | arxiv-cs.CL | 2024-02-02 |
188 | Exploring The Limits of Decoder-only Models Trained on Public Speech Recognition Corpora Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. |
Ankit Gupta; George Saon; Brian Kingsbury; | arxiv-cs.CL | 2024-01-31 |
189 | Improving ASR Performance with OCR Through Using Word Frequency Difference IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Recently, there has been a growing interest in conversational artificial intelligence (AI). As a result, research is actively being conducted on automatic speech recognition (ASR) … |
Kyudan Jung; Seungmin Bae; N. Kim; Hyun Gon Ryu; Hyuk-Jae Lee; | 2024 International Conference on Electronics, Information, … | 2024-01-28 |
190 | Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent research highlights the dependency of BPE subword tokenization’s efficacy on the morphological nature of the language, particularly in languages rich in inflectional morphology, where fewer BPE merges suffice for generating highly productive tokens. Motivated by this, our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity, thus enhancing out-of-distribution automatic speech recognition (ASR) performance. |
Ahnaf Mozib Samin; | arxiv-cs.CL | 2024-01-27 |
191 | Toward Practical Automatic Speech Recognition and Post-Processing: A Call for Explainable Error Benchmark Guideline Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. |
SEONMIN KOO et. al. | arxiv-cs.CL | 2024-01-25 |
192 | SpeechDPR: End-to-End Spoken Passage Retrieval for Open-Domain Spoken Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes the first known end-to-end framework, Speech Dense Passage Retriever (SpeechDPR), for the retrieval component of the openSQA problem. |
CHYI-JIUNN LIN et. al. | arxiv-cs.CL | 2024-01-24 |
193 | MF-AED-AEC: Speech Emotion Recognition By Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The prevalent approach in speech emotion recognition (SER) involves integrating both audio and textual information to comprehensively identify the speaker’s emotion, with the text … |
Jiajun He; Xiaohan Shi; Xingfeng Li; Tomoki Toda; | arxiv-cs.CL | 2024-01-24 |
194 | Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. |
W. RONNY HUANG et. al. | arxiv-cs.CL | 2024-01-23 |
195 | Keep Decoding Parallel with Effective Knowledge Distillation from Language Models to End-to-end Speech Recognisers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This study presents a novel approach for knowledge distillation (KD) from a BERT teacher model to an automatic speech recognition (ASR) model using intermediate layers. |
Michael Hentschel; Yuta Nishikawa; Tatsuya Komatsu; Yusuke Fujita; | arxiv-cs.CL | 2024-01-22 |
196 | Using Large Language Model for End-to-End Chinese ASR and NER Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This approach, however, has received less attention in the literature. In this work, we connect the Whisper encoder with ChatGLM3 and provide in-depth comparisons of these two approaches using Chinese automatic speech recognition (ASR) and name entity recognition (NER) tasks. |
YUANG LI et. al. | arxiv-cs.CL | 2024-01-20 |
197 | SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. |
Hao Wang; Shuhei Kurita; Shuichiro Shimizu; Daisuke Kawahara; | arxiv-cs.CV | 2024-01-18 |
198 | Joint Unsupervised and Supervised Training for Automatic Speech Recognition Via Bilevel Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present a novel bilevel optimization-based training approach to training acoustic models for automatic speech recognition (ASR) tasks that we term {bi-level joint unsupervised and supervised training (BL-JUST)}. |
A F M SAIF et. al. | arxiv-cs.CL | 2024-01-13 |
199 | LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In contrast to rare phrase lists, the slides within videos are synchronized in real-time with the speech, enabling the extraction of long contextual bias. Therefore, we propose a novel long-context biasing network (LCB-net) for audio-visual speech recognition (AVSR) to leverage the long-context information available in videos effectively. |
Fan Yu; Haoxu Wang; Xian Shi; Shiliang Zhang; | arxiv-cs.SD | 2024-01-12 |
200 | XLS-R Deep Learning Model for Multilingual ASR on Low- Resource Languages: Indonesian, Javanese, and Sundanese Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This research paper focuses on the development and evaluation of Automatic Speech Recognition (ASR) technology using the XLS-R 300m model. The study aims to improve ASR … |
Panji Arisaputra; Alif Tri Handoyo; Amalia Zahra; | ArXiv | 2024-01-12 |
201 | UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. |
JIAXIN GUO et. al. | arxiv-cs.CL | 2024-01-11 |
202 | Useful Blunders: Can Automated Speech Recognition Errors Improve Downstream Dementia Classification? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: \textbf{Objectives}: We aimed to investigate how errors from automatic speech recognition (ASR) systems affect dementia classification accuracy, specifically in the “Cookie Theft” picture description task. |
Changye Li; Weizhe Xu; Trevor Cohen; Serguei Pakhomov; | arxiv-cs.CL | 2024-01-10 |
203 | High-precision Voice Search Query Correction Via Retrievable Speech-text Embedings Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. |
CHRISTOPHER LI et. al. | arxiv-cs.CL | 2024-01-08 |
204 | An Audio-quality-based Multi-strategy Approach for Target Speaker Extraction in The MISP 2023 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes our audio-quality-based multi-strategy approach for the audio-visual target speaker extraction (AVTSE) task in the Multi-modal Information based Speech Processing (MISP) 2023 Challenge. |
RUNDUO HAN et. al. | arxiv-cs.SD | 2024-01-08 |
205 | Cross-Speaker Encoding Network for Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a Cross-Speaker Encoding (CSE) network to address the limitations of SIMO models by aggregating cross-speaker representations. |
JIAWEN KANG et. al. | arxiv-cs.SD | 2024-01-08 |
206 | ICMC-ASR: The ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To promote speech processing and recognition research in driving scenarios, we build on the success of the Intelligent Cockpit Speech Recognition Challenge (ICSRC) held at ISCSLP 2022 and launch the ICASSP 2024 In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge. |
HE WANG et. al. | arxiv-cs.SD | 2024-01-07 |
207 | MLCA-AVSR: Multi-Layer Cross Attention Fusion Based Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, current studies mainly focus on fusing the well-learned modality features, like the output of modality-specific encoders, without considering the contextual relationship during the modality feature learning. In this study, we propose a multi-layer cross-attention fusion based AVSR (MLCA-AVSR) approach that promotes representation learning of each modality by fusing them at different levels of audio/visual encoders. |
He Wang; Pengcheng Guo; Pan Zhou; Lei Xie; | arxiv-cs.SD | 2024-01-07 |
208 | Towards ASR Robust Spoken Language Understanding Through In-Context Learning With Word Confusion Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we introduce a method that utilizes the ASR system’s lattice output instead of relying solely on the top hypothesis, aiming to encapsulate speech ambiguities and enhance SLU outcomes. |
KEVIN EVERSON et. al. | arxiv-cs.CL | 2024-01-05 |
209 | Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. |
Rita Frieske; Bertram E. Shi; | arxiv-cs.CL | 2024-01-03 |
210 | Arabic Speech Recognition: Advancement and Challenges Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speech recognition is a captivating process that revolutionizes human-computer interactions, allowing us to interact and control machines through spoken commands. The foundation … |
ASHIFUR RAHMAN et. al. | IEEE Access | 2024-01-01 |
211 | Waveform-Domain Speech Enhancement Using Spectrogram Encoding for Robust Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While waveform-domain speech enhancement (SE) has been extensively investigated in recent years and achieves state-of-the-art performance in many datasets, spectrogram-based SE … |
Hao Shi; M. Mimura; Tatsuya Kawahara; | IEEE/ACM Transactions on Audio, Speech, and Language … | 2024-01-01 |
212 | Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition Using Adversarial Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, this paper presents an extensive comparative study of various data augmentation approaches to improve the robustness of pre-trained ASR model fine-tuning to dysarthric speech. |
HUIMENG WANG et. al. | arxiv-cs.SD | 2023-12-31 |
213 | Accented Speech Recognition With Accent-specific Codebooks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. |
Darshan Prabhu; Preethi Jyothi; Sriram Ganapathy; Vinit Unni; | emnlp | 2023-12-22 |
214 | Back Transcription As A Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. |
Marek Kubis; Pawel Sk�rzewski; Marcin Sowannski; Tomasz Zietkiewicz; | emnlp | 2023-12-22 |
215 | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). |
SRIJITH RADHAKRISHNAN et. al. | emnlp | 2023-12-22 |
216 | CS2W: A Chinese Spoken-to-Written Style Conversion Dataset with Multiple Conversion Types Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, the availability of datasets for this is limited. To address this issue, we present CS2W, a Chinese Spoken-to-Written style conversion dataset comprising 7,237 spoken sentences extracted from transcribed conversational texts. |
Zishan Guo; Linhao Yu; Minghui Xu; Renren Jin; Deyi Xiong; | emnlp | 2023-12-22 |
217 | Speech Recognition and Meaning Interpretation: Towards Disambiguation of Structurally Ambiguous Spoken Utterances in Indonesian Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we attempt to resolve structurally ambiguous utterances into unambiguous texts in Indonesian using prosodic information. |
RUHIYAH WIDIAPUTRI et. al. | emnlp | 2023-12-22 |
218 | KEBAP: Korean Error Explainable Benchmark Dataset for ASR and Post-processing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional evaluation metrics for ASR systems produce a singular aggregate score, which is insufficient for understanding specific system vulnerabilities. Therefore, we aim to address the limitations of the previous ASR evaluation methods by introducing the Korean Error Explainable Benchmark Dataset for ASR and Post-processing (KEBAP). |
SEONMIN KOO et. al. | emnlp | 2023-12-22 |
219 | CLAD-ST: Contrastive Learning with Adversarial Data for Robust Speech Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address this robustness problem in downstream MT models by forcing the MT encoder to bring the representations of a noisy input closer to its clean version in the semantic space. This is achieved by introducing a contrastive learning method that leverages adversarial examples in the form of ASR outputs paired with their corresponding human transcripts to optimize the network parameters. |
Sathish Indurthi; Shamil Chollampatt; Ravi Agrawal; Marco Turchi; | emnlp | 2023-12-22 |
220 | Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. |
Christopher Simic; Tobias Bocklet; | arxiv-cs.SD | 2023-12-21 |
221 | Multimodal Attention Merging for Improved Speech Recognition and Audio Event Classification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In cases where some data/compute is available, we present Learnable-MAM, a data-driven approach to merging attention matrices, resulting in a further 2.90% relative reduction in WER for ASR and 18.42% relative reduction in AEC compared to fine-tuning. |
ANIRUDH S. SUNDAR et. al. | arxiv-cs.LG | 2023-12-21 |
222 | KNN-CTC: Enhancing ASR Via Retrieval of CTC Pseudo Labels Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The success of retrieval-augmented language models in various natural language processing (NLP) tasks has been constrained in automatic speech recognition (ASR) applications due to challenges in constructing fine-grained audio-text datastores. This paper presents kNN-CTC, a novel approach that overcomes these challenges by leveraging Connectionist Temporal Classification (CTC) pseudo labels to establish frame-level audio-text key-value pairs, circumventing the need for precise ground truth alignments. |
JIAMING ZHOU et. al. | arxiv-cs.SD | 2023-12-20 |
223 | SpokesBiz — An Open Corpus of Conversational Polish Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We outline the general structure and content of the corpus, showcasing selected applications in linguistic research, evaluation and improvement of automatic speech recognition (ASR) systems |
PIOTR PĘZIK et. al. | arxiv-cs.CL | 2023-12-19 |
224 | Seq2seq for Automatic Paraphasia Detection in Aphasic Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a novel, sequence-to-sequence (seq2seq) model that is trained end-to-end (E2E) to perform both ASR and paraphasia detection tasks. |
MATTHEW PEREZ et. al. | arxiv-cs.SD | 2023-12-16 |
225 | Parameter-Efficient Cross-Language Transfer Learning for A Language-Modular Audiovisual Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In audiovisual speech recognition (AV-ASR), for many languages only few audiovisual data is available. Building upon an English model, in this work, we first apply and analyze … |
ZHENGYANG LI et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
226 | Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Transfer learning from large multilingual pretrained models, like XLSR, has become the new paradigm for Automatic Speech Recognition (ASR). Considering their ever-increasing size, … |
GEOFFROY VANDERREYDT et. al. | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
227 | Conformer-Based Speech Recognition On Extreme Edge-Computing Devices Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a series of model architecture adaptions, neural network graph transformations, and numerical optimizations to fit an advanced Conformer based end-to-end streaming ASR system on resource-constrained devices without accuracy degradation. |
MINGBIN XU et. al. | arxiv-cs.LG | 2023-12-16 |
228 | Leveraging The Multilingual Indonesian Ethnic Languages Dataset In Self-Supervised Models for Low-Resource ASR Task Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Indonesia is home to roughly 700 languages, which amounts to about ten percent of the global total, positioning it as the second-most linguistically diverse country after Papua … |
S. Sakti; Benita Angela Titalim; | 2023 IEEE Automatic Speech Recognition and Understanding … | 2023-12-16 |
229 | LiteVSR: Efficient Visual Speech Recognition By Learning from Speech Representations of Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. |
HENDRIK LAUX et. al. | arxiv-cs.CV | 2023-12-15 |
230 | Automatic Channel Selection and Spatial Feature Integration for Multi-channel Speech Recognition Across Various Array Topologies Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The focal point of the CHiME-7 Distant ASR task is to devise a unified system capable of generalizing various array topologies that have multiple recording devices and offering reliable recognition performance in real-world environments. Addressing this task, we introduce an ASR system that demonstrates exceptional performance across various array topologies. |
BINGSHEN MU et. al. | arxiv-cs.SD | 2023-12-15 |
231 | On The Compression of Shallow Non-causal ASR Models Using Knowledge Distillation and Tied-and-reduced Decoder for Low-latency On-device Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose shallow cascaded model by combining various model compression techniques such as knowledge distillation, shared decoder, and tied-and-reduced transducer network in order to reduce the model footprint. |
NAGARAJ ADIGA et. al. | arxiv-cs.SD | 2023-12-15 |
232 | Leveraging Language ID to Calculate Intermediate CTC Loss for Enhanced Code-Switching Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most past studies have simplified the learning complexity of the model by splitting the code-switching task into multiple tasks dealing with a single language and then learning the domain-specific knowledge of each language separately. Therefore, in this paper, we attempt to introduce language identification information into the middle layer of the ASR model’s encoder. |
Tzu-Ting Yang; Hsin-Wei Wang; Berlin Chen; | arxiv-cs.CL | 2023-12-15 |
233 | Extending Whisper with Prompt Tuning to Target-speaker ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work leverages prompt tuning, a parameter-efficient fine-tuning approach, to extend Whisper, a large-scale single-talker ASR model, to TS-ASR. |
Hao Ma; Zhiyuan Peng; Mingjie Shao; Jing Li; Ju Liu; | arxiv-cs.CL | 2023-12-13 |
234 | ROSE: A Recognition-Oriented Speech Enhancement Framework in Air Traffic Control Using Multi-Objective Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, a time-domain recognition-oriented speech enhancement (ROSE) framework is proposed to improve speech intelligibility and also advance ASR accuracy based on convolutional encoder-decoder-based U-Net framework, which serves as a plug-and-play tool in ATC scenarios and does not require additional retraining of the ASR model. |
Xincheng Yu; Dongyue Guo; Jianwei Zhang; Yi Lin; | arxiv-cs.SD | 2023-12-10 |
235 | Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This research optimizes two-pass cross-lingual transfer learning in low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme translation models. |
Wonjun Lee; Gary Geunbae Lee; Yunsu Kim; | arxiv-cs.CL | 2023-12-06 |
236 | End-to-End Speech-to-Text Translation: A Survey Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a result, researchers have been exploring end-to-end (E2E) models for ST translation. |
Nivedita Sethiya; Chandresh Kumar Maurya; | arxiv-cs.CL | 2023-12-02 |
237 | FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. |
Dongning Yang; Wei Wang; Yanmin Qian; | arxiv-cs.SD | 2023-11-29 |
238 | End-to-end Joint Rich and Normalized ASR with A Limited Amount of Rich Training Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we compare two different approaches to train a stateless Transducer-based E2E joint rich and normalized ASR system, ready for streaming applications, with a limited amount of rich labeled data. |
Can Cui; Imran Ahamad Sheikh; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.CL | 2023-11-29 |
239 | On The Effectiveness of ASR Representations in Real-world Noisy Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. |
Xiaohan Shi; Jiajun He; Xingfeng Li; Tomoki Toda; | arxiv-cs.SD | 2023-11-13 |
240 | Decoupling and Interacting Multi-Task Learning Network for Joint Speech and Accent Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Decoupling and Interacting Multi-task Network (DIMNet) for joint speech and accent recognition, which is comprised of a connectionist temporal classification (CTC) branch, an AR branch, an ASR branch, and a bottom feature encoder. |
Qijie Shao; Pengcheng Guo; Jinghao Yan; Pengfei Hu; Lei Xie; | arxiv-cs.SD | 2023-11-12 |
241 | A Survey of Technologies for Automatic Dysarthric Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View |
Zhaopeng Qian; K. Xiao; Chongchong Yu; | EURASIP Journal on Audio, Speech, and Music Processing | 2023-11-11 |
242 | Loss Masking Is Not Needed in Decoder-only Transformer for Discrete-token-based ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose to model speech tokens in an autoregressive way, similar to text. |
QIAN CHEN et. al. | arxiv-cs.CL | 2023-11-08 |
243 | Improved Child Text-to-Speech Synthesis Through Fastpitch-based Transfer Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents a novel approach that leverages the Fastpitch text-to-speech (TTS) model for generating high-quality synthetic child speech. |
Rishabh Jain; Peter Corcoran; | arxiv-cs.SD | 2023-11-07 |
244 | Pseudo-Labeling for Domain-Agnostic Bangla Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we propose a pseudo-labeling approach to develop a large-scale domain-agnostic ASR dataset. |
RABINDRA NATH NANDI et. al. | arxiv-cs.CL | 2023-11-06 |
245 | COSMIC: Data Efficient Instruction-tuning For Speech In-Context Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a cost-effective method to integrate speech into a large language model (LLM), resulting in a Contextual Speech Model with Instruction-following/in-context-learning Capabilities (COSMIC) multi-modal LLM. |
JING PAN et. al. | arxiv-cs.CL | 2023-11-03 |
246 | Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models Via Language-Specific Experts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we propose DistilWhisper, an approach able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. |
Thomas Palmeira Ferraz; Marcely Zanon Boito; Caroline Brun; Vassilina Nikoulina; | arxiv-cs.CL | 2023-11-02 |
247 | Disordered Speech Recognition Considering Low Resources and Abnormal Articulation Related Papers Related Patents Related Grants Related Venues Related Experts View |
Yuqin Lin; Longbiao Wang; Jianwu Dang; Sheng Li; Chenchen Ding; | Speech Commun. | 2023-11-01 |
248 | Learning Adapters for Code-Switching Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multilingual code-switching speech recognition has been an emerging research direction in real-world applications since most of speakers are bilingual or multilingual. A … |
Chun-Yi He; Jen-Tzung Chien; | 2023 Asia Pacific Signal and Information Processing … | 2023-10-31 |
249 | MUST: A Multilingual Student-Teacher Learning Approach for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, the aforementioned limitation is addressed by proposing a MUltilingual Student-Teacher (MUST) learning which exploits a posteriors mapping approach. |
Muhammad Umar Farooq; Rehan Ahmad; Thomas Hain; | arxiv-cs.CL | 2023-10-28 |
250 | MADGF: Multi-Agent Data Generation Framework Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Automatic Speech Recognition (ASR) systems predominantly cater to monolingual inputs and struggle with the complexity introduced by mixed language audio. In this paper, we present a novel Multi-Agent Data Generation Framework (MADGF) to address this challenge. |
Peng Xie; Kani Chen; | arxiv-cs.SD | 2023-10-27 |
251 | Uncovering Bias in ASR Systems: Evaluating Wav2vec2 and Whisper for Dutch Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: It is crucial that ASR systems can handle the wide range of variations in speech of speakers from different demographic groups, with different speaking styles, and of speakers … |
Márcio Fuckner; Sophie Horsman; Pascal Wiggers; Iskaj Janssen; | 2023 International Conference on Speech Technology and … | 2023-10-25 |
252 | ArTST: Arabic Text and Speech Transformer Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present ArTST, a pre-trained Arabic text and speech transformer for supporting open-source speech technologies for the Arabic language. |
Hawau Olamide Toyin; Amirbek Djanibekov; Ajinkya Kulkarni; Hanan Aldarmaki; | arxiv-cs.CL | 2023-10-25 |
253 | Back Transcription As A Method for Evaluating Robustness of Natural Language Understanding Models to Speech Recognition Errors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper proposes a method for investigating the impact of speech recognition errors on the performance of natural language understanding models. |
Marek Kubis; Paweł Skórzewski; Marcin Sowański; Tomasz Ziętkiewicz; | arxiv-cs.CL | 2023-10-25 |
254 | CDSD: Chinese Dysarthria Speech Database Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the Chinese Dysarthria Speech Database (CDSD) as a valuable resource for dysarthria research. |
MENGYI SUN et. al. | arxiv-cs.SD | 2023-10-24 |
255 | How Much Context Does My Attention-Based ASR System Need? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. |
Robert Flynn; Anton Ragni; | arxiv-cs.CL | 2023-10-24 |
256 | Hypotheses Paradise: An Open and Strong Baseline for Speech Recognition with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. |
CHEN CHEN et. al. | nips | 2023-10-24 |
257 | Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a streaming Transformer-Transducer (T-T) model able to jointly produce many-to-one and one-to-many transcription and translation using a single decoder. |
SARA PAPI et. al. | arxiv-cs.CL | 2023-10-23 |
258 | Intuitive Multilingual Audio-Visual Speech Recognition with A Single-Trained Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. |
Joanna Hong; Se Jin Park; Yong Man Ro; | arxiv-cs.MM | 2023-10-23 |
259 | Conversational Speech Recognition By Learning Audio-textual Cross-modal Contextual Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to irrelevant content, error propagation, and redundancy, existing methods struggle to extract longer and more effective contexts. To address this issue, we introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation. |
KUN WEI et. al. | arxiv-cs.SD | 2023-10-22 |
260 | BUT CHiME-7 System Description Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes the joint effort of Brno University of Technology (BUT), AGH University of Krakow and University of Buenos Aires on the development of Automatic Speech Recognition systems for the CHiME-7 Challenge. |
MARTIN KARAFIÁT et. al. | arxiv-cs.SD | 2023-10-18 |
261 | VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Due to the linguistic diversity and variations, it is challenging to build a robust and generalized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identification (DID) as well as automatic speech recognition (ASR) of Arabic. |
Abdul Waheed; Bashar Talafha; Peter Sullivan; AbdelRahim Elmadany; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2023-10-17 |
262 | Generative Error Correction for Code-switching Speech Recognition Using Large Language Models Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence. Despite the recent advances in automatic speech recognition (ASR), … |
CHEN CHEN et. al. | ArXiv | 2023-10-17 |
263 | Correction Focused Language Model Training for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce a novel correction focused LM training approach which aims to prioritize ASR fallible words. |
Yingyi Ma; Zhe Liu; Ozlem Kalinli; | arxiv-cs.CL | 2023-10-17 |
264 | Multi-stage Large Language Model Correction for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. |
Jie Pu; Thai-Son Nguyen; Sebastian Stüker; | arxiv-cs.CL | 2023-10-17 |
265 | End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. |
Can Cui; Imran Ahamad Sheikh; Mostafa Sadeghi; Emmanuel Vincent; | arxiv-cs.CL | 2023-10-16 |
266 | Personalization of CTC-based End-to-End Speech Recognition Using Pronunciation-Driven Subword Tokenization Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we describe our personalization solution for an end-to-end speech recognition system based on connectionist temporal classification. |
ZHIHONG LEI et. al. | arxiv-cs.LG | 2023-10-15 |
267 | Improved Contextual Recognition In Automatic Speech Recognition Systems By Semantic Lattice Rescoring Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing leveraging the power of deep learning models in accurately delivering spot-on transcriptions across a wide variety of vocabularies and speaking styles. |
Ankitha Sudarshan; Vinay Samuel; Parth Patwa; Ibtihel Amara; Aman Chadha; | arxiv-cs.CL | 2023-10-14 |
268 | SALM: Speech-augmented Language Model with In-context Learning for Speech Recognition and Translation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel Speech Augmented Language Model (SALM) with {\em multitask} and {\em in-context} learning capabilities. |
ZHEHUAI CHEN et. al. | arxiv-cs.CL | 2023-10-13 |
269 | On The Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. |
Nick Rossenbach; Benedikt Hilmes; Ralf Schlüter; | arxiv-cs.CL | 2023-10-12 |
270 | Adapting The Adapters for Code-switching in Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. |
Atharva Kulkarni; Ajinkya Kulkarni; Miguel Couceiro; Hanan Aldarmaki; | arxiv-cs.CL | 2023-10-11 |
271 | Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR). |
SRIJITH RADHAKRISHNAN et. al. | arxiv-cs.CL | 2023-10-10 |
272 | No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While in the context of hybrid ASR models several solutions have been proposed, the gender bias issue has not been explicitly addressed in end-to-end neural architectures. To fill this gap, we propose a data augmentation technique that manipulates the fundamental frequency (f0) and formants. |
Dennis Fucci; Marco Gaido; Matteo Negri; Mauro Cettolo; Luisa Bentivogli; | arxiv-cs.CL | 2023-10-10 |
273 | Acoustic Model Fusion for End-to-end Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Drawing inspiration from the concept of LM fusion, we propose the integration of an external AM into the E2E system to better address the domain mismatch. |
ZHIHONG LEI et. al. | arxiv-cs.SD | 2023-10-10 |
274 | Ed-cec: Improving Rare Word Recognition Using Asr Postprocessing Based on Error Detection and Context-aware Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Automatic speech recognition (ASR) systems often encounter difficulties in accurately recognizing rare words, leading to errors that can have a negative impact on downstream tasks such as keyword spotting, intent detection, and text summarization. To address this challenge, we present a novel ASR postprocessing method that focuses on improving the recognition of rare words through error detection and context-aware error correction. |
Jiajun He; Zekun Yang; Tomoki Toda; | arxiv-cs.AI | 2023-10-08 |
275 | Improving End-to-End Speech Processing By Efficient Text Data Utilization with Latent Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. |
JIANQIAO LU et. al. | arxiv-cs.CL | 2023-10-08 |
276 | LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. |
ZHIHAO DU et. al. | arxiv-cs.SD | 2023-10-06 |
277 | Dementia Assessment Using Mandarin Speech with An Attention-based Speech Recognition Encoder Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper utilizes a speech recognition model to construct a dementia assessment system tailored for Mandarin speakers during the picture description task. |
ZIH-JYUN LIN et. al. | arxiv-cs.CL | 2023-10-05 |
278 | EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. |
Tejes Srivastava; Jiatong Shi; William Chen; Shinji Watanabe; | arxiv-cs.SD | 2023-10-05 |
279 | LibriSpeech-PC: Benchmark for Evaluation of Punctuation and Capitalization Capabilities of End-to-end ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a LibriSpeech-PC benchmark designed to assess the punctuation and capitalization prediction capabilities of end-to-end ASR models. |
ALEKSANDR MEISTER et. al. | arxiv-cs.CL | 2023-10-04 |
280 | Unsupervised Speech Recognition with N-Skipgram and Positional Unigram Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Training unsupervised speech recognition systems presents challenges due to GAN-associated instability, misalignment between speech and text, and significant memory demands. To tackle these challenges, we introduce a novel ASR system, ESPUM. |
Liming Wang; Mark Hasegawa-Johnson; Chang D. Yoo; | arxiv-cs.CL | 2023-10-03 |
281 | Evaluating Speech Synthesis By Training Recognizers on Synthetic Speech Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Prior works focus on evaluating synthetic speech based on pre-trained speech recognition models, however, this can be limiting since this approach primarily measures speech intelligibility. In this paper, we propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech. |
DAREEN ALHARTHI et. al. | arxiv-cs.CL | 2023-10-01 |
282 | AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. |
TOBI OLATUNJI et. al. | arxiv-cs.CL | 2023-09-30 |
283 | AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. |
Andrew Rouditchenko; Ronan Collobert; Tatiana Likhomanenko; | arxiv-cs.LG | 2023-09-29 |
284 | Federated Learning with Differential Privacy for End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we aim to bridge this research gap by formulating an ASR benchmark for FL with DP and establishing the first baselines. |
MARTIN PELIKAN et. al. | arxiv-cs.LG | 2023-09-29 |
285 | SLM: Bridge The Thin Gap Between Speech and Text Foundation Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. |
MINGQIU WANG et. al. | arxiv-cs.CL | 2023-09-29 |
286 | The Gift of Feedback: Improving ASR Model Quality By Learning from User Corrections Through Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the context of models trained on the server but deployed on edge devices, errors may result from the mismatch between server training data and actual on-device usage. In this work, we seek to continually learn from on-device user corrections through Federated Learning (FL) to address this issue. |
LILLIAN ZHOU et. al. | arxiv-cs.CL | 2023-09-29 |
287 | LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. |
GUODONG MA et. al. | arxiv-cs.SD | 2023-09-28 |
288 | Speech Collage: Code-switched Audio Generation By Collaging Monolingual Corpora Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address data scarcity, this paper introduces Speech Collage, a method that synthesizes CS data from monolingual corpora by splicing audio segments. |
AMIR HUSSEIN et. al. | arxiv-cs.SD | 2023-09-27 |
289 | Lip2Vec: Efficient and Robust Visual Speech Recognition Via Latent-to-Latent Visual to Audio Representation Mapping Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Un-like previous works that involve auxiliary losses or com-plex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. |
Yasser Abdelaziz Dahou Djilali; Sanath Narayan; Haithem Boussaid; Ebtessam Almazrouei; Merouane Debbah; | iccv | 2023-09-27 |
290 | HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. |
CHEN CHEN et. al. | arxiv-cs.CL | 2023-09-27 |
291 | Unsupervised Pre-Training for Vietnamese Automatic Speech Recognition in The HYKIST Project Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In today’s interconnected globe, moving abroad is more and more prevalent, whether it’s for employment, refugee resettlement, or other causes. Language difficulties between … |
Khai Le-Duc; | ArXiv | 2023-09-26 |
292 | Updated Corpora and Benchmarks for Long-Form Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we re-release three standard ASR corpora – TED-LIUM 3, Gigapeech, and VoxPopuli-en – with updated transcription and alignments to enable their use for long-form ASR research. |
JENNIFER DREXLER FOX et. al. | arxiv-cs.CL | 2023-09-26 |
293 | AudioFool: Fast, Universal and Synchronization-free Cross-Domain Attack on Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recent research has focused on exploring methods to create such attacks, however, some issues relating to Over-The-Air (OTA) attacks have not been properly addressed. In our work, we examine the needed properties of robust attacks compatible with the OTA model, and we design a method of generating attacks with arbitrary such desired properties, namely the invariance to synchronization, and the robustness to filtering: this allows a Denial-of-Service (DoS) attack against ASR systems. |
Mohamad Fakih; Rouwaida Kanj; Fadi Kurdahi; Mohammed E. Fouda; | arxiv-cs.CR | 2023-09-20 |
294 | Directional Source Separation for Robust Speech Recognition on Smart Glasses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To improve voice quality, this work investigates directional source separation using the multi-microphone array. |
TIANTIAN FENG et. al. | arxiv-cs.SD | 2023-09-19 |
295 | Instruction-Following Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the mechanisms behind these models’ speech understanding and reasoning capabilities remain underexplored. To study this question from the data perspective, we introduce instruction-following speech recognition, training a Listen-Attend-Spell model to understand and execute a diverse set of free-form text instructions. |
Cheng-I Jeff Lai; Zhiyun Lu; Liangliang Cao; Ruoming Pang; | arxiv-cs.CL | 2023-09-18 |
296 | HypR: A Comprehensive Study for ASR Hypothesis Revising with A Reference Corpus Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Accordingly, we first concentrate on providing an ASR hypothesis revising (HypR) dataset in this study. |
Yi-Wei Wang; Ke-Han Lu; Kuan-Yu Chen; | arxiv-cs.CL | 2023-09-18 |
297 | Are Soft Prompts Good Zero-shot Learners for Speech Recognition? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, not many people understand how and why this is so. In this study, we aim to deepen our understanding of this emerging method by investigating the role of soft prompts in automatic speech recognition (ASR). |
DIANWEN NG et. al. | arxiv-cs.SD | 2023-09-17 |
298 | Augmenting Conformers with Structured State-space Sequence Models for Online Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. |
HAOZHE SHAN et. al. | arxiv-cs.CL | 2023-09-15 |
299 | Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. |
YANG LI et. al. | arxiv-cs.LG | 2023-09-14 |
300 | Echotune: A Modular Extractor Leveraging The Variable-Length Nature of Speech in ASR Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. |
Sizhou Chen; Songyang Gao; Sen Fang; | arxiv-cs.SD | 2023-09-14 |
301 | CPPF: A Contextual and Post-processing-free Model for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we focus on ASR-related processing tasks, including Contextual ASR and multiple ASR post processing tasks. |
LEI ZHANG et. al. | arxiv-cs.CL | 2023-09-13 |
302 | SlideSpeech: A Large Scale Slide-Enriched Audio-Visual Corpus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches … |
HAOXU WANG et. al. | ICASSP 2024 – 2024 IEEE International Conference on … | 2023-09-11 |
303 | SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. |
HAOXU WANG et. al. | arxiv-cs.SD | 2023-09-11 |
304 | Leveraging Large Language Models for Exploiting ASR Uncertainty Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end, we propose prompting the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. |
PRANAY DIGHE et. al. | arxiv-cs.CL | 2023-09-09 |
305 | Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. |
Huaibo Zhao; Yosuke Higuchi; Yusuke Kida; Tetsuji Ogawa; Tetsunori Kobayashi; | arxiv-cs.SD | 2023-09-08 |
306 | LanSER: Language-Model Supported Speech Emotion Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present LanSER, a method that enables the use of unlabeled data by inferring weak emotion labels via pre-trained large language models through weakly-supervised learning. |
TAESIK GONG et. al. | arxiv-cs.CL | 2023-09-07 |
307 | Bring The Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. |
Patrick Eickhoff; Matthias Möller; Theresa Pekarek Rosin; Johannes Twiefel; Stefan Wermter; | arxiv-cs.CL | 2023-09-05 |
308 | SememeASR: Boosting Performance of End-to-End Speech Recognition Against Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Considering that knowledge-driven approaches can help data-driven approaches alleviate their flaws, we introduce sememe-based semantic knowledge information to speech recognition (SememeASR). |
Jiaxu Zhu; Changhe Song; Zhiyong Wu; Helen Meng; | arxiv-cs.SD | 2023-09-04 |
309 | Text-Only Domain Adaptation for End-to-End Speech Recognition Through Down-Sampling Acoustic Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. |
JIAXU ZHU et. al. | arxiv-cs.SD | 2023-09-04 |
310 | Boosting Low-Resource Speech Recognition in Air Traffic Communication Via Pretrained Feature Aggregation and Multi-Task Learning IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Developing a robust Automatic Speech Recognition (ASR) system usually requires a large amount of well-annotated samples which is extremely hard to build in the Air Traffic Control … |
Dongyue Guo; Zichen Zhang; Bo Yang; Jianwei Zhang; Yi Lin; | IEEE Transactions on Circuits and Systems II: Express Briefs | 2023-09-01 |
311 | ASTER: Automatic Speech Recognition System Accessibility Testing for Stutterers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the challenge, we propose ASTER, a technique for automatically testing the accessibility of ASR systems. |
YI LIU et. al. | arxiv-cs.SD | 2023-08-29 |
312 | Speech Wikimedia: A 77 Language Multilingual Speech Dataset Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: The Speech Wikimedia Dataset is a publicly available compilation of audio with transcriptions extracted from Wikimedia Commons. It includes 1780 hours (195 GB) of CC-BY-SA … |
RAFAEL MOSQUERA GÓMEZ et. al. | arxiv-cs.AI | 2023-08-29 |
313 | Naaloss: Rethinking The Objective of Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence, in this study, we suggest a Noise- and Artifacts-aware loss function, NAaLoss, to ameliorate the influence of artifacts from a novel perspective. |
Kuan-Hsun Ho; En-Lun Yu; Jeih-weih Hung; Berlin Chen; | arxiv-cs.SD | 2023-08-24 |
314 | Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a cross-modal global interaction and local alignment (GILA) approach for AVSR, which captures the deep audio-visual (A-V) correlations from both global and local perspectives. |
YUCHEN HU et. al. | ijcai | 2023-08-23 |
315 | Convoifilter: A Case Study of Doing Cocktail Party Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. |
Thai-Binh Nguyen; Alexander Waibel; | arxiv-cs.SD | 2023-08-22 |
316 | SeamlessM4T: Massively Multilingual & Multimodal Machine Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. |
SEAMLESS COMMUNICATION et. al. | arxiv-cs.CL | 2023-08-22 |
317 | On Training A Neural Residual Acoustic Echo Suppressor for Improved ASR Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Acoustic Echo Cancellation (AEC) is critical for accurate recognition of speech directed at a smart device playing audio. Previous work has showed that neural AEC models can … |
S. Panchapagesan; T. Shabestary; A. Narayanan; | Interspeech | 2023-08-20 |
318 | A Conformer-based Classifier for Variable-length Utterance Processing in Anti-spoofing Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The success achieved by conformers in Automatic Speech Recognition (ASR) leads us to their application in other domains, such as spoofing detection for automatic speaker … |
Eros Rosello; Alejandro Gomez-Alanis; A. Gómez; A. Peinado; | Interspeech | 2023-08-20 |
319 | Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper addresses effective pretraining of automatic speech recognition (ASR) and gender recognition to improve wav2vec 2.0 embedding for speech emotion recognition (SER). … |
Yuan Gao; Chenhui Chu; Tatsuya Kawahara; | Interspeech | 2023-08-20 |
320 | Thai Dialect Corpus and Transfer-based Curriculum Learning Investigation for Dialect Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We release 840 hours of read speech multi-dialect ASR corpora consisting of 700 hours of main Thai dialect, named Thai-central, and 40 hours for each local dialect , named … |
Artit Suwanbandit; Burin Naowarat; Orathai Sangpetch; E. Chuangsuwanich; | Interspeech | 2023-08-20 |
321 | Embedding Articulatory Constraints for Low-resource Speech Recognition Based on Large Pre-trained Model Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Knowledge about phonemes and their articulatory attributes can help improve automatic speech recognition (ASR) of low-resource languages. In this study, we propose a simple and … |
Jaeyoung Lee; M. Mimura; Tatsuya Kawahara; | Interspeech | 2023-08-20 |
322 | Few-shot Dysarthric Speech Recognition with Text-to-Speech Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Speakers with dysarthria could particularly benefit from assistive speech technology, but are underserved by current automatic speech recognition (ASR) systems. The differences of … |
Enno Hermann; Mathew Magimai; | Interspeech | 2023-08-20 |
323 | MiniStreamer: Enhancing Small Conformer with Chunked-Context Masking for Streaming ASR Applications on The Edge Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Real-time applications of Automatic Speech Recognition (ASR) on user devices on the edge require streaming processing. Conformer model has achieved state-of-the-art performance in … |
Haris Gulzar; Monikka Roslianna Busto; Takeharu Eda; Katsutoshi Itoyama; K. Nakadai; | Interspeech | 2023-08-20 |
324 | Whisper Features for Dysarthric Severity-Level Classification Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Dysarthria is a speech disorder caused by improper coordination between the brain and the muscles that produce intelligible speech. Accurately diagnosing the severity of … |
Siddharth Rathod; Monil Charola; Akshat Vora; Yash Jogi; H. Patil; | Interspeech | 2023-08-20 |
325 | TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TokenSplit, a speech separation model that acts on discrete token sequences. |
HAKAN ERDOGAN et. al. | arxiv-cs.SD | 2023-08-20 |
326 | Unsupervised Code-switched Text Generation from Parallel Text Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: There has been great interest in developing automatic speech recognition (ASR) systems that can handle code-switched (CS) speech to meet the needs of a growing bilingual … |
JI-EUN CHI et. al. | Interspeech | 2023-08-20 |
327 | Exploring Sources of Racial Bias in Automatic Speech Recognition Through The Lens of Rhythmic Variation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Although studies have shown that one issue of bias in modern automatic speech recognition (ASR) technologies is degraded performance for African American English (AAE) speakers, … |
Li-Fang Lai; N. Holliday; | Interspeech | 2023-08-20 |
328 | Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing Based Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View |
ZHENG LIANG et. al. | Interspeech | 2023-08-20 |
329 | Bayes Risk Transducer: Transducer with Controllable Alignment Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a Bayes risk function to set lower risk values to the preferred paths so that the predicted alignment is more likely to satisfy specific desired properties. |
JINCHUAN TIAN et. al. | arxiv-cs.CL | 2023-08-19 |
330 | An Ambient Intelligence-based Approach For Longitudinal Monitoring of Verbal and Vocal Depression Symptoms Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Another major challenge in depression relapse research is the scarcity of publicly available datasets. To overcome these issues, we propose a one-shot learning framework for detecting depression relapse from speech. |
Alice Othmani; Muhammad Muzammel; | arxiv-cs.HC | 2023-08-16 |
331 | Accurate Synthesis of Dysarthric Speech for ASR Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a new dysarthric speech synthesis method for the purpose of ASR training data augmentation. |
Mohammad Soleymanpour; Michael T. Johnson; Rahim Soleymanpour; Jeffrey Berry; | arxiv-cs.SD | 2023-08-16 |
332 | Radio2Text: Streaming Speech Recognition Using MmWave Radio Signals Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. |
Running Zhao; Jiangtao Yu; Hang Zhao; Edith C. H. Ngai; | arxiv-cs.SD | 2023-08-15 |
333 | A Comprehensive Survey on Automatic Speech Recognition Using Neural Networks Related Papers Related Patents Related Grants Related Venues Related Experts View |
Amandeep Singh Dhanjal; Williamjeet Singh; | Multim. Tools Appl. | 2023-08-15 |
334 | Using Text Injection to Improve Recognition of Personal Identifiers in Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We use text-injection to improve the recognition of PII categories by including fake textual substitutes of PII categories in the training data using a text injection method. |
YOCHAI BLAU et. al. | arxiv-cs.CL | 2023-08-14 |
335 | Text Injection for Capitalization and Turn-Taking Prediction in Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we use joint end-to-end and internal language model training (JEIT) as our text injection algorithm to train an ASR model which performs two auxiliary tasks. |
SHAAN BIJWADIA et. al. | arxiv-cs.CL | 2023-08-14 |
336 | A Novel Self-training Approach for Low-resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a self-training approach for automatic speech recognition (ASR) for low-resource settings. |
Satwinder Singh; Feng Hou; Ruili Wang; | arxiv-cs.CL | 2023-08-09 |
337 | Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). |
Yang Zhang; Krishna C. Puvvada; Vitaly Lavrukhin; Boris Ginsburg; | arxiv-cs.SD | 2023-08-09 |
338 | Boosting Chinese ASR Error Correction with Dynamic Error Scaling Mechanism Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a novel approach that incorporates a dynamic error scaling mechanism to detect and correct phonetically erroneous text generated by ASR output. |
JIAXIN FAN et. al. | arxiv-cs.CL | 2023-08-07 |
339 | Federated Representation Learning for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints. |
GURUPRASAD V RAMESH et. al. | arxiv-cs.SD | 2023-08-03 |
340 | Inaudible Adversarial Perturbation: Manipulating The Recognition of User Speech in Real Time Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we seek to bridge the gap in existing research and extend the attack to user-present scenarios. |
XINFENG LI et. al. | arxiv-cs.CR | 2023-08-02 |
341 | ÌròyìnSpeech: A Multi-purpose Yorùbá Speech Corpus Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce \`{I}r\`{o}y\`{i}nSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yor\`{u}b\'{a} speech data, which can be used for both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) tasks. |
Tolulope Ogunremi; Kola Tubosun; Anuoluwapo Aremu; Iroro Orife; David Ifeoluwa Adelani; | arxiv-cs.CL | 2023-07-29 |
342 | The Timing Bottleneck: Why Timing and Overlap Are Mission-critical for Conversational User Interfaces, Speech Recognition and Dialogue Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We find that word error rates for natural conversational data in 6 languages remain abysmal, and that overlap remains a key challenge (study 1). |
Andreas Liesenfeld; Alianda Lopez; Mark Dingemanse; | arxiv-cs.CL | 2023-07-28 |
343 | Modeling Spoken Information Queries for Virtual Assistants: Open Problems, Challenges and Opportunities Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We discuss open problems and challenges with respect to modeling spoken information queries for virtual assistants, and list opportunities where Information Retrieval methods and research can be applied to improve the quality of virtual assistant speech recognition. |
Christophe Van Gysel; | sigir | 2023-07-25 |
344 | Boosting Punctuation Restoration with Data Generation and Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While punctuated texts are abundant from written documents, the discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts. This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap. |
VIET DAC LAI et. al. | arxiv-cs.CL | 2023-07-24 |
345 | Adaptation of Whisper Models to Child Speech Recognition IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: Automatic Speech Recognition (ASR) systems often struggle with transcribing child speech due to the lack of large child speech datasets required to accurately train child-friendly … |
Rishabh Jain; Andrei Barcovschi; Mariam Yiwere; Peter Corcoran; H. Cucu; | ArXiv | 2023-07-24 |
346 | Code-Switched Urdu ASR for Noisy Telephonic Environment Using Data Centric Approach with Hybrid HMM and CNN-TDNN Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, this paper describes an implementation framework of a resource efficient Automatic Speech Recognition/ Speech to Text System in a noisy call-center environment using Chain Hybrid HMM and CNN-TDNN for Code-Switched Urdu Language. |
Muhammad Danyal Khan; Raheem Ali; Arshad Aziz; | arxiv-cs.CL | 2023-07-24 |
347 | Exploring The Integration of Speech Separation and Recognition with Self-Supervised Learning Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. |
YOSHIKI MASUYAMA et. al. | arxiv-cs.SD | 2023-07-23 |
348 | A Meta Learning Scheme for Fast Accent Domain Expansion in Mandarin Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce meta-learning techniques for fast accent domain expansion in mandarin speech recognition, which expands the field of accents without deteriorating the performance of mandarin ASR. |
Ziwei Zhu; Changhao Shan; Bihong Zhang; Jian Yu; | arxiv-cs.SD | 2023-07-23 |
349 | Robust Automatic Speech Recognition Via WavAugment Guided Phoneme Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (wapat). |
GEGE QI et. al. | arxiv-cs.SD | 2023-07-23 |
350 | Modality Confidence Aware Training for Robust End-to-End Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To overcome this issue, we propose a novel E2E SLU system that enhances robustness to ASR errors by fusing audio and text representations based on the estimated modality confidence of ASR hypotheses. We introduce two novel techniques: 1) an effective method to encode the quality of ASR hypotheses and 2) an effective approach to integrate them into E2E SLU models. |
SUYOUN KIM et. al. | arxiv-cs.CL | 2023-07-22 |
351 | A Change of Heart: Improving Speech Emotion Recognition Through Speech-to-Text Modality Conversion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we introduce a modality conversion concept aimed at enhancing emotion recognition performance on the MELD dataset. |
Zeinab Sadat Taghavi; Ali Satvaty; Hossein Sameti; | arxiv-cs.SD | 2023-07-21 |
352 | A Deep Dive Into The Disparity of Word Error Rates Across Thousands of NPTEL MOOC Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we describe the curation of a massive speech dataset of 8740 hours consisting of $\sim9.8$K technical lectures in the English language along with their transcripts delivered by instructors representing various parts of Indian demography. |
Anand Kumar Rai; Siddharth D Jaiswal; Animesh Mukherjee; | arxiv-cs.CL | 2023-07-20 |
353 | Ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: We introduceivrit.ai, a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) … |
Yanir Marmor; Kinneret Misgav; Y. Lifshitz; | ArXiv | 2023-07-17 |
354 | Replay to Remember: Continual Layer-Specific Fine-tuning for German Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To further increase the robustness of the ASR model to vocabulary and speakers outside of the fine-tuned domain, we apply Experience Replay for continual learning. |
Theresa Pekarek Rosin; Stefan Wermter; | arxiv-cs.CL | 2023-07-14 |
355 | SGGNet$^2$: Speech-Scene Graph Grounding Network for Speech-guided Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel speech-scene graph grounding network (SGGNet$^2$) that robustly grounds spoken utterances by leveraging the acoustic similarity between correctly recognized and misrecognized words obtained from automatic speech recognition (ASR) systems. |
DOHYUN KIM et. al. | arxiv-cs.RO | 2023-07-14 |
356 | SGGNet2: Speech-Scene Graph Grounding Network for Speech-guided Navigation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: The spoken language serves as an accessible and efficient interface, enabling non-experts and disabled users to interact with complex assistant robots. However, accurately … |
DOHYUN KIM et. al. | 2023 32nd IEEE International Conference on Robot and Human … | 2023-07-14 |
357 | Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We compare our model with encoders pretrained on self-supervised learning (SSL), and show that ASR pretraining is much more effective than SSL for SICSF. |
He Huang; Jagadeesh Balam; Boris Ginsburg; | arxiv-cs.CL | 2023-07-13 |
358 | Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a computation-efficient network named Language-Routing Mixture of Experts (LR-MoE) for multilingual and code-switching ASR. |
Wenxuan Wang; Guodong Ma; Yuke Li; Binbin Du; | arxiv-cs.SD | 2023-07-12 |
359 | Exploring The Integration of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. |
Zeping Min; Jinbo Wang; | arxiv-cs.CL | 2023-07-12 |
360 | SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. |
Titouan Parcollet; Rogier van Dalen; Shucong Zhang; Sourav Bhattacharya; | arxiv-cs.CL | 2023-07-12 |
361 | The NPU-MSXF Speech-to-Speech Translation System for IWSLT 2023 Speech-to-Speech Translation Task Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes the NPU-MSXF system for the IWSLT 2023 speech-to-speech translation (S2ST) task which aims to translate from English speech of multi-source to Chinese speech. |
KUN SONG et. al. | arxiv-cs.SD | 2023-07-10 |
362 | Introducing Semantics Into Speech Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a task-agnostic unsupervised way of incorporating semantic information from LLMs into self-supervised speech encoders without labeled audio transcriptions. |
DEREK XU et. al. | acl | 2023-07-08 |
363 | Building Accurate Low Latency ASR for Streaming Voice Search in E-commerce Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we build accurate LSTM, attention and CTC based streaming ASR models for large-scale Hinglish (blend of Hindi and English) Voice Search. |
Abhinav Goyal; Nikesh Garera; | acl | 2023-07-08 |
364 | SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. |
SUWON SHON et. al. | acl | 2023-07-08 |
365 | Hybrid Transducer and Attention Based Encoder-Decoder Modeling for Speech-to-Text Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. |
YUN TANG et. al. | acl | 2023-07-08 |
366 | DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. |
SURAJ KOTHAWADE et. al. | acl | 2023-07-08 |
367 | BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems. |
MINGDA CHEN et. al. | acl | 2023-07-08 |
368 | STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present STT4SG-350, a corpus of Swiss German speech, annotated with Standard German text at the sentence level. |
MICHEL PL�SS et. al. | acl | 2023-07-08 |
369 | A Theory of Unsupervised Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we proposed a general theoretical framework to study the properties of {pasted macro �ASRU�}/ systems based on random matrix theory and the theory of neural tangent kernels. |
Liming Wang; Mark Hasegawa-Johnson; Chang Yoo; | acl | 2023-07-08 |
370 | Why Aren�t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we examine in detail the complex relationship between ASR and NER errors which limit the ability of NER models to recover entity mentions from spontaneous speech transcripts. |
PIOTR SZYMANSKI et. al. | acl | 2023-07-08 |
371 | Back Translation for Speech-to-text Translation Without Transcripts IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we aim to utilize large amounts of target-side monolingual data to enhance ST without transcripts. |
Qingkai Fang; Yang Feng; | acl | 2023-07-08 |
372 | Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we investigate whether data augmentation techniques could help improve low-resource ASR performance, focusing on four typologically diverse minority languages or language variants (West Germanic: Gronings, West-Frisian; Malayo-Polynesian: Besemah, Nasal). |
Martijn Bartelds; Nay San; Bradley McDonnell; Dan Jurafsky; Martijn Wieling; | acl | 2023-07-08 |
373 | Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To produce ASR and ST content effectively with minimal latency, we propose a joint token-level serialized output training method that interleaves source and target words by leveraging an off-the-shelf textual aligner. |
SARA PAPI et. al. | arxiv-cs.CL | 2023-07-06 |
374 | Transcribing Educational Videos Using Whisper: A Preliminary Study on Using AI for Transcribing Educational Videos Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Videos are increasingly being used for e-learning, and transcripts are vital to enhance the learning experience. The costs and delays of generating transcripts can be alleviated … |
Ashwin Rao; | ArXiv | 2023-07-04 |
375 | Knowledge-Aware Audio-Grounded Generative Slot Filling for Limited Annotated Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a Knowledge-Aware Audio-Grounded generative slot-filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input. |
Guangzhi Sun; Chao Zhang; Ivan Vulić; Paweł Budzianowski; Philip C. Woodland; | arxiv-cs.CL | 2023-07-04 |
376 | Boosting Norwegian Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokm{\aa}l and Nynorsk. |
Javier de la Rosa; Rolv-Arild Braaten; Per Egil Kummervold; Freddy Wetjen; Svein Arne Brygfjeld; | arxiv-cs.CL | 2023-07-04 |
377 | Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End Speech Recognition Systems Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we aim to reduce bias against different age groups and non-native speakers of Dutch. |
Tanvina Patel; Odette Scharenborg; | arxiv-cs.CL | 2023-07-04 |
378 | Multilingual Contextual Adapters To Improve Custom Word Recognition In Low-resource Languages Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a supervision loss for smoother training of the Contextual Adapters. |
Devang Kulshreshtha; Saket Dingliwal; Brady Houston; Sravan Bodapati; | arxiv-cs.CL | 2023-07-03 |
379 | Don’t Stop Self-Supervision: Accent Adaptation of Speech Representations Via Residual Adapters Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, such representations may be skewed toward canonical data characteristics of such corpora and perform poorly on atypical, non-native accented speaker populations. With the state-of-the-art HuBERT model as a baseline, we propose and investigate self-supervised adaptation of speech representations to such populations in a parameter-efficient way via training accent-specific residual adapters. |
ANSHU BHATIA et. al. | arxiv-cs.CL | 2023-07-01 |
380 | Trends and Developments in Automatic Speech Recognition Research Related Papers Related Patents Related Grants Related Venues Related Experts View |
D. O’Shaughnessy; | Comput. Speech Lang. | 2023-07-01 |
381 | Automatic Speech Recognition of Non-Native Child Speech for Language Learning Applications Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We sought to assess the performance of two state-of-the-art ASR systems, Wav2Vec2.0 and Whisper AI, with a view to developing a voicebot that can support children acquiring a foreign language. |
Simone Wills; Yu Bai; Cristian Tejedor-Garcia; Catia Cucchiarini; Helmer Strik; | arxiv-cs.CL | 2023-06-29 |
382 | Accelerating Transducers Through Adjacent Token Merging Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, this design is inefficient, particularly for long speech signals due to the quadratic computation of self-attention. To address this, we propose a new method, Adjacent Token Merging (A-ToMe), which gradually combines adjacent tokens with high similarity scores between their key values. |
Yuang Li; Yu Wu; Jinyu Li; Shujie Liu; | arxiv-cs.CL | 2023-06-28 |
383 | Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM). |
Yuang Li; Yu Wu; Jinyu Li; Shujie Liu; | arxiv-cs.CL | 2023-06-28 |
384 | Cascaded Encoders for Fine-tuning ASR Models on Overlapped Speech Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents an MT-ASR model formed by combining a well-trained foundation model with a multi-talker mask model in a cascaded RNN-T encoder configuration. |
Richard Rose; Oscar Chang; Olivier Siohan; | arxiv-cs.SD | 2023-06-28 |
385 | Don’t Be So Sure! Boosting ASR Decoding Via Confidence Relaxation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We perform a layer analysis to reveal and visualize how predictions evolve, and propose a decoding procedure that improves the performance of fine-tuned ASR models. |
Tomer Wullach; Shlomo E. Chazan; | aaai | 2023-06-26 |
386 | Complex Dynamic Neurons Improved Spiking Transformer Network for Efficient Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here we introduce four types of neuronal dynamics to post-process the sequential patterns generated from the spiking transformer to get the complex dynamic neuron improved spiking transformer neural network (DyTr-SNN). |
QINGYU WANG et. al. | aaai | 2023-06-26 |
387 | Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, different from audio forced alignment, it is challenging to develop a reliable visual forced alignment technology for the following two reasons: 1) Visual Speech Recognition (VSR) has a much lower performance compared to audio-based Automatic Speech Recognition (ASR), and 2) the translation from text to video is not reliable, so the method typically used for building audio forced alignment cannot be utilized in developing visual forced alignment. In order to alleviate these challenges, in this paper, we propose a new method that is appropriate for visual forced alignment, namely Deep Visual Forced Alignment (DVFA). |
Minsu Kim; Chae Won Kim; Yong Man Ro; | aaai | 2023-06-26 |
388 | Performance Disparities Between Accents in Automatic Speech Recognition (Student Abstract) Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this work, we expand the discussion of bias in Automatic Speech Recognition (ASR) through a large-scale audit. Using a large and global data set of speech, we perform an audit … |
Alex DiChristofano; Henry Shuster; Shefali Chandra; Neal Patwari; | AAAI Conference on Artificial Intelligence | 2023-06-26 |
389 | An Analysis of Personalized Speech Recognition System Development for The Deaf and Hard-of-Hearing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To do so, we analyze the use of openly-available automatic speech recognition (ASR) tools with a DHH Japanese speaker dataset. As these out-of-the-box ASR models typically do not perform well on DHH speech, we provide a thorough analysis of creating personalized ASR systems. |
Lester Phillip Violeta; Tomoki Toda; | arxiv-cs.SD | 2023-06-24 |
390 | Mixture Encoder for Joint Speech Separation and Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work proposes a middle-ground approach that leverages explicit speech separation similarly to the modular approach but also incorporates mixture speech information directly into the ASR module in order to mitigate the propagation of errors made by the speech separator. |
Simon Berger; Peter Vieting; Christoph Boeddeker; Ralf Schlüter; Reinhold Haeb-Umbach; | arxiv-cs.CL | 2023-06-21 |
391 | NoRefER: A Referenceless Quality Metric for Automatic Speech Recognition Via Semi-Supervised Language Model Fine-Tuning with Contrastive Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces NoRefER, a novel referenceless quality metric for automatic speech recognition (ASR) systems. |
Kamer Ali Yuksel; Thiago Ferreira; Golara Javadi; Mohamed El-Badrashiny; Ahmet Gunduz; | arxiv-cs.CL | 2023-06-21 |
392 | Multi-pass Training and Cross-information Fusion for Low-resource End-to-end Accented Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we propose a Conformer-based architecture, called Aformer, to leverage both the acoustic information from large non-accented and limited accented training data. |
Xuefei Wang; Yanhua Long; Yijie Li; Haoran Wei; | arxiv-cs.SD | 2023-06-20 |
393 | A Comparative Analysis of Automatic Speech Recognition Errors in Small Group Classroom Discourse IF:3 Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In collaborative learning environments, effective intelligent learning systems need to accurately analyze and understand the collaborative discourse between learners (i.e., group … |
JIE CAO et. al. | Proceedings of the 31st ACM Conference on User Modeling, … | 2023-06-18 |
394 | Research on An Improved Conformer End-to-end Speech Recognition Model with R-Drop Structure Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address the issue of poor generalization ability in end-to-end speech recognition models within deep learning, this study proposes a new Conformer-based speech recognition model called Conformer-R that incorporates the R-drop structure. |
Weidong Ji; Shijie Zan; Guohui Zhou; Xu Wang; | arxiv-cs.SD | 2023-06-14 |
395 | Improving Code-Switching and Named Entity Recognition in ASR with Speech Editing Based Data Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the current data augmentation methods mainly rely on audio splicing and text-to-speech (TTS) models, which might result in discontinuous, unrealistic, and less diversified speech. To mitigate these potential issues, we propose a novel data augmentation method by applying the text-based speech editing model. |
ZHENG LIANG et. al. | arxiv-cs.CL | 2023-06-14 |
396 | Learning Cross-lingual Mappings for Data Augmentation to Improve Low-Resource Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Recently, a novel multilingual model fusion technique has been proposed where a model is trained to learn cross-lingual acoustic-phonetic similarities as a mapping function. |
Muhammad Umar Farooq; Thomas Hain; | arxiv-cs.CL | 2023-06-14 |
397 | IIITH-CSTD Corpus: Crowdsourced Strategies for The Collection of A Large-scale Telugu Speech Corpus Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Due to the lack of a large annotated speech corpus, many low-resource Indian languages struggle to utilize recent advancements in deep neural network architectures for Automatic … |
MIRISHKAR SAI GANESH et. al. | ACM Transactions on Asian and Low-Resource Language … | 2023-06-12 |
398 | Multimodal Audio-textual Architecture for Robust Spoken Language Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. |
Anderson R. Avila; Mehdi Rezagholizadeh; Chao Xing; | arxiv-cs.CL | 2023-06-11 |
399 | Adversarial Training For Low-Resource Disfluency Correction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose an adversarially-trained sequence-tagging model for Disfluency Correction (DC) that utilizes a small amount of labeled real disfluent data in conjunction with a large amount of unlabeled data. |
Vineet Bhat; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2023-06-10 |
400 | Developing Speech Processing Pipelines for Police Accountability Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We investigate the potential of large pre-trained speech models for facilitating reviews, focusing on ASR and officer speech detection in footage from traffic stops. |
Anjalie Field; Prateek Verma; Nay San; Jennifer L. Eberhardt; Dan Jurafsky; | arxiv-cs.CL | 2023-06-09 |
401 | Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: End-to-end (E2E) systems have shown comparable performance to hybrid systems for automatic speech recognition (ASR). Word timings, as a by-product of ASR, are essential in many … |
Xianzhao Chen; Yist Y. Lin; Kang Wang; Yi He; Zejun Ma; | ArXiv | 2023-06-09 |
402 | Zambezi Voice: A Multilingual Speech Corpus for Zambian Languages Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. |
CLAYTONE SIKASOTE et. al. | arxiv-cs.CL | 2023-06-07 |
403 | An ASR-Based Tutor for Learning to Read: How to Optimize Feedback to First Graders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In a previous study, we presented an ASR-based Dutch reading tutor application that was developed to provide instantaneous feedback to first-graders learning to read. |
Yu Bai; Cristian Tejedor-Garcia; Ferdy Hubers; Catia Cucchiarini; Helmer Strik; | arxiv-cs.CL | 2023-06-07 |
404 | Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a new lenient evaluation metric as a more defensible CER measure for Japanese ASR. |
Shigeki Karita; Richard Sproat; Haruko Ishikawa; | arxiv-cs.CL | 2023-06-07 |
405 | Arabic Dysarthric Speech Recognition Using Adversarial and Signal-Based Augmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we aim to improve the performance of Arabic dysarthric automatic speech recognition through a multi-stage augmentation approach. |
Massa Baali; Ibrahim Almakky; Shady Shehata; Fakhri Karray; | arxiv-cs.SD | 2023-06-07 |
406 | Label Aware Speech Representation Learning For Language Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task. |
SHIKHAR VASHISHTH et. al. | arxiv-cs.CL | 2023-06-07 |
407 | A Study on The Impact of Self-Supervised Learning on Automatic Dysarthric Speech Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that HuBERT is the most versatile feature extractor across dysarthria classification, word recognition, and intelligibility classification, achieving respectively $+24.7\%, +61\%, \text{and} +7.2\%$ accuracy compared to classical acoustic features. |
Xavier F. Cadet; Ranya Aloufi; Sara Ahmadi-Abhari; Hamed Haddadi; | arxiv-cs.CL | 2023-06-07 |
408 | Alzheimer Disease Classification Through ASR-based Transcriptions: Exploring The Impact of Punctuation and Pauses Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we used the new state-of-the-art Automatic Speech Recognition (ASR) model Whisper to obtain the transcriptions, which also include automatic punctuation. |
LUCÍA GÓMEZ-ZARAGOZÁ et. al. | arxiv-cs.CL | 2023-06-06 |
409 | N-Shot Benchmarking of Whisper on Diverse Arabic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, it is not clear how Whisper would fare under diverse conditions even on languages it was evaluated on such as Arabic. In this work, we address this gap by comprehensively evaluating Whisper on several varieties of Arabic speech for the ASR task. |
Bashar Talafha; Abdul Waheed; Muhammad Abdul-Mageed; | arxiv-cs.CL | 2023-06-05 |
410 | End-to-End Joint Target and Non-Target Speakers ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel automatic speech recognition (ASR) system that can transcribe individual speaker’s speech while identifying whether they are target or non-target speakers from multi-talker overlapped speech. |
RYO MASUMURA et. al. | arxiv-cs.CL | 2023-06-04 |
411 | A Reference-Less Quality Metric for Automatic Speech Recognition Via Contrastive-Learning of A Multi-Language Model with Self-Supervision Summary Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Abstract: The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual … |
K. Yuksel; Thiago Castro Ferreira; Ahmet Gunduz; Mohamed Al-Badrashiny; Golara Javadi; | 2023 IEEE International Conference on Acoustics, Speech, … | 2023-06-04 |
412 | Incorporating L2 Phonemes Using Articulatory Features for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The limited availability of non-native speech datasets presents a major challenge in automatic speech recognition (ASR) to narrow the performance gap between native and non-native speakers. To address this, the focus of this study is on the efficient incorporation of the L2 phonemes, which in this work refer to Korean phonemes, through articulatory feature analysis. |
Jisung Wang; Haram Lee; Myungwoo Oh; | arxiv-cs.CL | 2023-06-04 |
413 | SpellMapper: A Non-autoregressive Neural Spellchecker for ASR Customization with Candidate Retrieval Based on N-gram Mappings Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose: 1) a novel algorithm for candidate retrieval, based on misspelled n-gram mappings, which gives up to 90% recall with just the top 10 candidates on Spoken Wikipedia; 2) a non-autoregressive neural model based on BERT architecture, where the initial transcript and ten candidates are combined into one input. |
Alexandra Antonova; Evelina Bakhturina; Boris Ginsburg; | arxiv-cs.CL | 2023-06-04 |
414 | Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Combining several active learning paradigms and the core-set approach, we propose a new multi-rounds adaptation process that uses epistemic uncertainty to automate the annotation process, significantly reducing the associated costs and human labor. |
Bonaventure F. P. Dossou; | arxiv-cs.CL | 2023-06-03 |
415 | Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data. Imperfectly transcribed speech is a prevalent issue in … |
DONGJI GAO et. al. | ArXiv | 2023-06-01 |
416 | SlothSpeech: Denial-of-service Attack Against Speech Recognition Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we propose SlothSpeech, a denial-of-service attack against ASR models, which exploits the dynamic behaviour of the model. |
MIRAZUL HAQUE et. al. | arxiv-cs.SD | 2023-06-01 |
417 | Towards Hate Speech Detection in Low-resource Languages: Comparing ASR to Acoustic Word Embeddings on Wolof and Swahili Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We specifically use a multilingual AWE model trained on labelled data from well-resourced languages to spot keywords in data in the unseen target language. |
Christiaan Jacobs; Nathanaël Carraz Rakotonirina; Everlyn Asiko Chimoto; Bruce A. Bassett; Herman Kamper; | arxiv-cs.CL | 2023-06-01 |
418 | Inspecting Spoken Language Understanding from Kids for Basic Math Learning at Home Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work explores Spoken Language Understanding (SLU) pipeline within a task-oriented dialogue system developed for Kid Space, with cascading Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU) components evaluated on our home deployment data with kids going through gamified math learning activities. |
Eda Okur; Roddy Fuentes Alba; Saurav Sahay; Lama Nachman; | arxiv-cs.CY | 2023-06-01 |
419 | Adaptation and Optimization of Automatic Speech Recognition (ASR) for The Maritime Domain in The Field of VHF Communication Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a multilingual automatic speech recognizer (ASR) for maritime radio communi-cation that automatically converts received VHF radio signals into text. |
Emin Cagatay Nakilcioglu; Maximilian Reimann; Ole John; | arxiv-cs.SD | 2023-06-01 |
420 | Strategies for Improving Low Resource Speech to Text Translation Relying on Pre-trained ASR Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST). |
Santosh Kesiraju; Marek Sarvas; Tomas Pavlicek; Cecile Macaire; Alejandro Ciuba; | arxiv-cs.CL | 2023-05-31 |
421 | The Tag-Team Approach: Leveraging CLS and Language Tagging for Enhancing Multilingual ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, new approaches are explored and compared to improve the performance of CLS based multilingual ASR model. |
Kaousheik Jayakumar; Vrunda N. Sukhadia; A Arunkumar; S. Umesh; | arxiv-cs.CL | 2023-05-31 |
422 | Zero-Shot Automatic Pronunciation Assessment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT. |
Hongfu Liu; Mingqian Shi; Ye Wang; | arxiv-cs.SD | 2023-05-31 |
423 | Accurate and Structured Pruning for Efficient Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance. |
HUIQIANG JIANG et. al. | arxiv-cs.CL | 2023-05-31 |
424 | STT4SG-350: A Speech Corpus for All Swiss German Dialect Regions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present STT4SG-350 (Speech-to-Text for Swiss German), a corpus of Swiss German speech, annotated with Standard German text at the sentence level. |
MICHEL PLÜSS et. al. | arxiv-cs.CL | 2023-05-30 |
425 | Towards Selection of Text-to-speech Data to Augment ASR Training Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: This paper presents a method for selecting appropriate synthetic speech samples from a given large text-to-speech (TTS) dataset as supplementary training data for an automatic … |
SHUO LIU et. al. | ArXiv | 2023-05-30 |
426 | Improving Textless Spoken Language Understanding with Discrete Units As Intermediate Target Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, inspired by the content-disentangled discrete units from self-supervised speech models, we proposed to use discrete units as intermediate guidance to improve textless SLU performance. |
Guan-Wei Wu; Guan-Ting Lin; Shang-Wen Li; Hung-yi Lee; | arxiv-cs.CL | 2023-05-29 |
427 | CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a simple-to-follow recipe aligned to the SpeechBrain toolkit for accent classification based on Common Voice 7.0 (English) and Common Voice 11.0 (Italian, German, and Spanish). |
Juan Zuluaga-Gomez; Sara Ahmed; Danielius Visockas; Cem Subakan; | arxiv-cs.CL | 2023-05-29 |
428 | HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. |
Florian Mai; Juan Zuluaga-Gomez; Titouan Parcollet; Petr Motlicek; | arxiv-cs.CL | 2023-05-29 |
429 | Building Accurate Low Latency ASR for Streaming Voice Search Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we focus on developing accurate LSTM, attention, and CTC based streaming ASR models for large-scale Hinglish (a blend of Hindi and English) Voice Search. |
Abhinav Goyal; Nikesh Garera; | arxiv-cs.SD | 2023-05-29 |
430 | Exploration of Efficient End-to-End ASR Using Discretized Input from Self-Supervised Learning IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a new protocol that utilizes discretized token sequences in ASR tasks, which includes de-duplication and sub-word modeling to enhance the input sequence. |
Xuankai Chang; Brian Yan; Yuya Fujita; Takashi Maekaku; Shinji Watanabe; | arxiv-cs.SD | 2023-05-29 |
431 | Speech and Noise Dual-stream Spectrogram Refine Network with Speech Distortion Loss for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. |
HAOYU LU et. al. | arxiv-cs.SD | 2023-05-28 |
432 | Retraining-free Customized ASR for Enharmonic Words Based on A Named-Entity-Aware Model and Phoneme Similarity Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since such NE words tend to be important keywords, ASR easily loses user trust if it misrecognizes them. To solve these problems, this paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation. |
Yui Sudo; Kazuya Hata; Kazuhiro Nakadai; | arxiv-cs.SD | 2023-05-28 |
433 | Synthesizing Speech Test Cases with Text-to-Speech? An Empirical Study on The False Alarms in Automated Speech Recognition Testing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study, we investigate false alarm occurrences in five popular ASR systems using synthetic audio generated from four TTS systems and human audio obtained from two commonly used datasets. |
JULIA KAIWEN LAU et. al. | arxiv-cs.SE | 2023-05-27 |
434 | DisfluencyFixer: A Tool to Enhance Language Learning Through Speech To Speech Disfluency Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents DisfluencyFixer, a tool that performs speech-to-speech disfluency correction in English and Hindi using a pipeline of Automatic Speech Recognition (ASR), Disfluency Correction (DC) and Text-To-Speech (TTS) models. |
Vineet Bhat; Preethi Jyothi; Pushpak Bhattacharyya; | arxiv-cs.CL | 2023-05-26 |
435 | INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced Non-Native Speech Recognition Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automatic Speech Recognition (ASR) systems have attained unprecedented performance with large speech models pre-trained based on self-supervised speech representation learning. … |
Eunseop Yoon; Hee Suk Yoon; John Harvill; M. Hasegawa-Johnson; C. Yoo; | Annual Meeting of the Association for Computational … | 2023-05-25 |
436 | Svarah: Evaluating English ASR Systems on Indian Accents Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Unfortunately, Indian speakers find a very poor representation in existing English ASR benchmarks such as LibriSpeech, Switchboard, Speech Accent Archive, etc. In this work, we address this gap by creating Svarah, a benchmark that contains 9.6 hours of transcribed English audio from 117 speakers across 65 geographic locations throughout India, resulting in a diverse range of accents. |
TAHIR JAVED et. al. | arxiv-cs.CL | 2023-05-25 |
437 | Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with A Sidecar Separator Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR) system into a multi-talker one, by inserting a Sidecar separator into the frozen well-trained ASR model. Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters. |
LINGWEI MENG et. al. | arxiv-cs.SD | 2023-05-25 |
438 | Iteratively Improving Speech Recognition and Voice Conversion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel iterative way of improving both the ASR and VC models. |
Mayank Kumar Singh; Naoya Takahashi; Onoe Naoyuki; | arxiv-cs.SD | 2023-05-24 |
439 | InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. |
ZHI-HAO LAI et. al. | arxiv-cs.CL | 2023-05-24 |
440 | SE-Bridge: Speech Enhancement with Consistent Brownian Bridge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose SE-Bridge, a novel method for speech enhancement (SE). |
Zhibin Qiu; Mengfan Fu; Fuchun Sun; Gulila Altenbek; Hao Huang; | arxiv-cs.SD | 2023-05-23 |
441 | Personalized Predictive ASR for Latency Reduction in Voice Assistants Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. |
Andreas Schwarz; Di He; Maarten Van Segbroeck; Mohammed Hethnawi; Ariya Rastrow; | arxiv-cs.CL | 2023-05-23 |
442 | Evaluating OpenAI’s Whisper ASR for Punctuation Prediction and Topic Modeling of Life Histories of The Museum of The Person Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This chapter presents the first study on the performance of Whisper for punctuation prediction in the Portuguese language. |
LUCAS RAFAEL STEFANEL GRIS et. al. | arxiv-cs.CL | 2023-05-23 |
443 | Text Generation with Speech Synthesis for ASR Data Augmentation Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Aiming at reducing the reliance on expensive human annotations, data synthesis for Automatic Speech Recognition (ASR) has remained an active area of research. While prior work … |
ZHUANGQUN HUANG et. al. | ArXiv | 2023-05-22 |
444 | Cross-lingual Knowledge Transfer and Iterative Pseudo-labeling for Low-Resource Speech Recognition with Transducers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our goal is to train an all-neural Transducer-based ASR system to replace a DNN-HMM hybrid system with no manually annotated training data. |
JAN SILOVSKY et. al. | arxiv-cs.CL | 2023-05-22 |
445 | GNCformer Enhanced Self-attention for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting the ESA is also worth being explored.In this paper, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named GNCformer. |
J. Li; Z. Duan; S. Li; X. Yu; G. Yang; | arxiv-cs.SD | 2023-05-22 |
446 | Self-supervised Representations in Speech-based Depression Detection IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes handling training data sparsity in speech-based automatic depression detection (SDD) using foundation models pre-trained with self-supervised learning (SSL). |
Wen Wu; Chao Zhang; Philip C. Woodland; | arxiv-cs.CL | 2023-05-20 |
447 | A Lexical-aware Non-autoregressive Transformer-based ASR Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A series of experiments are conducted on the AISHELL-1, CSJ, and TEDLIUM 2 datasets. |
Chong-En Lin; Kuan-Yu Chen; | arxiv-cs.CL | 2023-05-18 |
448 | FunASR: A Fundamental End-to-End Speech Recognition Toolkit IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. |
ZHIFU GAO et. al. | arxiv-cs.SD | 2023-05-18 |
449 | A Comparative Study on E-Branchformer Vs Conformer in Speech Recognition, Translation, and Understanding Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models. |
YIFAN PENG et. al. | arxiv-cs.CL | 2023-05-18 |
450 | AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present AVFormer, a simple method for augmenting audioonly models with visual information, at the same time performing lightweight domain adaptation. |
Paul Hongsuck Seo; Arsha Nagrani; Cordelia Schmid; | cvpr | 2023-05-17 |
451 | MmMIC: Multi-modal Speech Recognition Based on MmWave Radar Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: With the proliferation of voice assistants, microphone-based speech recognition technology usually cannot achieve good performance in the situation of multiple sound sources and … |
LONG FAN et. al. | IEEE INFOCOM 2023 – IEEE Conference on Computer … | 2023-05-17 |
452 | OOD-Speech: A Large Bengali Speech Recognition Dataset for Out-of-Distribution Benchmarking Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: We present OOD-Speech, the first out-of-distribution (OOD) benchmarking dataset for Bengali automatic speech recognition (ASR). Being one of the most spoken languages globally, … |
FAZLE RAKIB et. al. | ArXiv | 2023-05-15 |
453 | Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. |
Weiwei Lin; Chenhang He; Man-Wai Mak; Youzhi Tu; | arxiv-cs.SD | 2023-05-14 |
454 | Investigating The Sensitivity of Automatic Speech Recognition Systems to Phonetic Variation in L2 Englishes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work demonstrates a method of probing an ASR system to discover how it handles phonetic variation across a number of L2 Englishes. |
Emma O’Neill; Julie Carson-Berndsen; | arxiv-cs.CL | 2023-05-12 |
455 | Multi-Temporal Lip-Audio Memory for Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. |
Jeong Hun Yeo; Minsu Kim; Yong Man Ro; | arxiv-cs.CV | 2023-05-08 |
456 | Hybrid Transducer and Attention Based Encoder-Decoder Modeling for Speech-to-Text Tasks IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. |
YUN TANG et. al. | arxiv-cs.CL | 2023-05-04 |
457 | TrojanModel: A Practical Trojan Attack Against Automatic Speech Recognition Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: While deep learning techniques have achieved great success in modern digital products, researchers have shown that deep learning models are susceptible to Trojan attacks. In a … |
W. Zong; Yang-Wai Chow; Willy Susilo; Kien Do; S. Venkatesh; | 2023 IEEE Symposium on Security and Privacy (SP) | 2023-05-01 |
458 | Automatic Speech Recognition of Portuguese Phonemes Using Neural Networks Ensemble Related Papers Related Patents Related Grants Related Venues Related Experts View |
N. Nedjah; Alejandra D. Bonilla; Luiza de Macedo Mourelle; | Expert Syst. Appl. | 2023-05-01 |
459 | Edge Computing Solutions Supporting Voice Recognition Services for Speakers with Dysarthria Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In the framework of Automatic Speech Recognition (ASR), the synergism between edge computing and artificial intelligence has led to the development of intelligent objects that … |
Davide Mulfari; Lorenzo Carnevale; A. Galletta; M. Villari; | 2023 IEEE/ACM 23rd International Symposium on Cluster, … | 2023-05-01 |
460 | Building A Non-native Speech Corpus Featuring Chinese-English Bilingual Children: Compilation and Rationale Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces a non-native speech corpus consisting of narratives from fifty 5- to 6-year-old Chinese-English children. |
Hiuchung Hung; Andreas Maier; Thorsten Piske; | arxiv-cs.CL | 2023-04-30 |
461 | Enhancing Multilingual Speech Recognition in Air Traffic Control By Sentence-level Language Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a two-stage multilingual ASR framework. |
Peng Fan; Dongyue Guo; JianWei Zhang; Bo Yang; Yi Lin; | arxiv-cs.SD | 2023-04-29 |
462 | Hierarchical Softmax for End-To-End Low-Resource Multilingual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose an approach that leverages neighboring languages to improve low-resource scenario performance, founded on the hypothesis that similar linguistic units in neighboring languages exhibit comparable term frequency distributions, which enables us to construct a Huffman tree for performing multilingual hierarchical Softmax decoding. |
Q. LIU et. al. | icassp | 2023-04-27 |
463 | Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a novel modeling framework for effective training of end-to-end automatic speech recognition (ASR) models on various sources of data from diverse domains: speech paired with clean ground truth transcripts, speech with noisy pseudo transcripts from semi-supervised decodes and unpaired text-only data. |
T. Fukuda; S. Thomas; | icassp | 2023-04-27 |
464 | Speech Summarization of Long Spoken Document: Improving Memory Efficiency of Speech/Text Encoders Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a speech summarization system that enables E2E summarization from 100 seconds, which is the limit of the conventional method, to up to 10 minutes (i.e., the duration of typical instructional videos on YouTube). |
T. KANO et. al. | icassp | 2023-04-27 |
465 | DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, for generative tasks such as speech enhancement and speech separation, most self-supervised speech representations did not show substantial improvements. To deal with this problem, in this paper, we propose data2vec-SG (Speech Generation), which is a teacher-student learning framework that addresses speech generation tasks. |
H. WANG et. al. | icassp | 2023-04-27 |
466 | Stabilising and Accelerating Light Gated Recurrent Units for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, the unbounded nature of its rectified linear unit on the candidate recurrent gate induces a gradient exploding phenomenon disrupting the training process and preventing it from being applied to medium to large ASR datasets. In this paper, we theoretically and empirically derive the necessary conditions for its stability as well as engineering mechanisms to speed up by a factor of five its training time, hence introducing a novel version of this architecture named SLi-GRU. |
A. Moumen; T. Parcollet; | icassp | 2023-04-27 |
467 | Exploring Self-Supervised Pre-Trained ASR Models for Dysarthric and Elderly Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper explores a series of approaches to integrate domain adapted Self-Supervised Learning (SSL) pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition: a) input feature fusion between standard acoustic frontends and domain adapted wav2vec2.0 speech representations; b) frame-level joint decoding of TDNN systems separately trained using standard acoustic features alone and with additional wav2vec2.0 features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain adapted wav2vec2.0 models. |
S. HU et. al. | icassp | 2023-04-27 |
468 | Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. |
S. Sadhu; H. Hermansky; | icassp | 2023-04-27 |
469 | A Sidecar Separator Can Convert A Single-Talker Speech Recognition System to A Multi-Talker One IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Although automatic speech recognition (ASR) can perform well in common non-overlapping environments, sustaining performance in multi-talker overlapping speech recognition remains … |
L. MENG et. al. | icassp | 2023-04-27 |
470 | Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. |
P. MA et. al. | icassp | 2023-04-27 |
471 | Reducing Language Confusion for Code-Switching Speech Recognition with Token-Level Language Diarization IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We address the problem of language confusion for improving CS-ASR from two perspectives: incorporating and disentangling language information. |
H. LIU et. al. | icassp | 2023-04-27 |
472 | Improving Speech-to-Speech Translation Through Unlabeled Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an effective way to utilize the massive existing unlabeled text from different languages to create a large amount of S2ST data to improve S2ST performance by applying various acoustic effects to the generated synthetic data. |
X. -P. NGUYEN et. al. | icassp | 2023-04-27 |
473 | Domain Adaptation with External Off-Policy Acoustic Catalogs for Scalable Contextual End-to-End Automated Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we investigate the potential of leveraging external knowledge, particularly through off-policy generated text-to-speech key-value stores, to allow for flexible post-training adaptation to new data distributions. |
D. M. Chan; S. Ghosh; A. Rastrow; B. Hoffmeister; | icassp | 2023-04-27 |
474 | The Edinburgh International Accents of English Corpus: Towards The Democratization of English ASR IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). |
R. SANABRIA et. al. | icassp | 2023-04-27 |
475 | Fine-Grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since these are E2E models operating on speech directly, there remains a potential to improve their performance using purely text based models like BERT, which have strong language understanding capabilities. In this paper, we propose a new training criteria for RNN-T based E2E ASR and SLU to transfer BERT’s knowledge into these systems. |
V. Sunder; S. Thomas; H. -K. J. Kuo; B. Kingsbury; E. Fosler-Lussier; | icassp | 2023-04-27 |
476 | Robust Audio-Visual ASR with Unified Cross-Modal Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a new audio-visual speech recognition model with a unified cross-modal attention mechanism. |
J. Li; C. Li; Y. Wu; Y. Qian; | icassp | 2023-04-27 |
477 | Do Coarser Units Benefit Cluster Prediction-Based Speech Pre-Training? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The research community has produced many successful self-supervised speech representation learning methods over the past few years. |
A. ELKAHKY et. al. | icassp | 2023-04-27 |
478 | Multi-Resolution Location-Based Training for Multi-Channel Continuous Speech Separation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce multi-resolution LBT to estimate the complex spectrograms from low to high time and frequency resolutions. |
H. Taherian; D. Wang; | icassp | 2023-04-27 |
479 | Transcription Free Filler Word Detection with Neural Semi-CRFs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work, we investigate filler word detection system1 that does not depend on ASR systems. |
G. Zhu; Y. Yan; J. -P. Caceres; Z. Duan; | icassp | 2023-04-27 |
480 | An ASR-Free Fluency Scoring Approach with Self-Supervised Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes a novel ASR-free approach for automatic fluency assessment using self-supervised learning (SSL). |
W. LIU et. al. | icassp | 2023-04-27 |
481 | Noise-Aware Target Extension with Self-Distillation for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a noise-aware target extension (NATE) that extends the senone target to contain noise awareness by jointly classifying the senone and noise in a single branch. |
J. -S. Seong; J. -H. Choi; J. Kyung; Y. -R. Jeoung; J. -H. Chang; | icassp | 2023-04-27 |
482 | Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: When training data is lacking in ASR, a large-scale pre-training and fine tuning framework is often sufficient to achieve high recognition rates; however, in electrolaryngeal speech, the domain shift between the pretraining and fine-tuning data is too large to over-come, limiting the maximum improvement of recognition rates. To resolve this, we propose an intermediate fine-tuning step that uses imperfect synthetic speech to close the domain shift gap between the pretraining and target data. |
L. P. Violeta; D. Ma; W. -C. Huang; T. Toda; | icassp | 2023-04-27 |
483 | Using Adapters to Overcome Catastrophic Forgetting in End-to-End Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we aim to overcome CF for E2E ASR by inserting adapters, small architectures of few parameters which allow a general model to be fine-tuned to a specific task, into our model. |
S. V. Eeckt; H. Van Hamme; | icassp | 2023-04-27 |
484 | Text Is All You Need: Personalizing ASR Models Using Controllable Speech Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. |
K. Yang; T. -Y. Hu; J. -H. R. Chang; H. Swetha Koppula; O. Tuzel; | icassp | 2023-04-27 |
485 | Selective Film Conditioning with CTC-Based ASR Probability for Speech Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Although prior studies have improved the performance, they are inefficient because the two networks are combined and require large model sizes. To address this limitation, we propose an efficient way to use feature-wise linear modulation (FiLM) conditioning with CTC-based ASR probabilities for the SE system. |
D. -H. Yang; J. -H. Chang; | icassp | 2023-04-27 |
486 | Domain and Language Adaptation Using Heterogeneous Datasets for Wav2vec2.0-Based Speech Recognition of Low-Resource Language Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address the effective finetuning of a large-scale pretrained model for automatic speech recognition (ASR) of lowresource languages with only a one-hour matched dataset. |
K. Soky; S. Li; C. Chu; T. Kawahara; | icassp | 2023-04-27 |
487 | Speech Reconstruction from Silent Tongue and Lip Articulation By Pseudo Target Generation and Domain Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose to employ a method built on pseudo target generation and domain adversarial training with an iterative training strategy to improve the intelligibility and naturalness of the speech recovered from silent tongue and lip articulation. |
R. -C. Zheng; Y. Ai; Z. -H. Ling; | icassp | 2023-04-27 |
488 | UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose UCorrect, an unsupervised Detector-Generator-Selector framework for ASR Error Correction. |
J. GUO et. al. | icassp | 2023-04-27 |
489 | Leveraging Large Text Corpora For End-To-End Speech Summarization IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present two novel methods that leverage a large amount of external text summarization data for E2E SSum training. |
K. MATSUURA et. al. | icassp | 2023-04-27 |
490 | Automatic Severity Classification of Dysarthric Speech By Using Self-Supervised Model with Multi-Task Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To tackle the problem, we propose a novel automatic severity assessment method for dysarthric speech, using the self-supervised model in conjunction with multi-task learning. |
E. J. Yeo; K. Choi; S. Kim; M. Chung; | icassp | 2023-04-27 |
491 | Weavspeech: Data Augmentation Strategy For Automatic Speech Recognition Via Semantic-Aware Weaving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Furthermore, if speech signals are indiscriminately mixed without considering semantics, the risk of generating nonsensical sentences arises. To address these issues, in this paper, we propose WeavSpeech, still a simple yet effective cut-and-paste augmentation method for ASR tasks that weaves a pair of speech data considering semantics. |
K. Seo; J. Park; J. Song; E. Yang; | icassp | 2023-04-27 |
492 | The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper describes our NPU-ASLP system for the Audio-Visual Diarization and Recognition (AVDR) task in the Multi-modal Information based Speech Processing (MISP) 2022 Challenge. |
P. Guo; H. Wang; B. Mu; A. Zhang; P. Chen; | icassp | 2023-04-27 |
493 | SAN: A Robust End-to-End ASR Model Architecture Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Siamese Adversarial Network (SAN) architecture for automatic speech recognition, which aims at solving the difficulty of fuzzy audio recognition. |
Z. Min; Q. Ge; G. Huang; | icassp | 2023-04-27 |
494 | An Adapter Based Multi-Label Pre-Training for Speech Separation and Enhancement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Based on HuBERT, this work investigates improving the SSL model for SS and SE. |
T. Wang; X. Chen; Z. Chen; S. Yu; W. Zhu; | icassp | 2023-04-27 |
495 | Weight Averaging: A Simple Yet Effective Method to Overcome Catastrophic Forgetting in Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Focusing on End-to-End ASR, in this paper, we propose a simple yet effective method to overcome catastrophic forgetting: weight averaging. |
S. Vander Eeckt; H. Van Hamme; | icassp | 2023-04-27 |
496 | Ensemble Knowledge Distillation of Self-Supervised Speech Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: On top of that, we proposed a multiple prediction head method for student models to predict different layer outputs of multiple teacher models simultaneously. |
K. . -P. HUANG et. al. | icassp | 2023-04-27 |
497 | Federated Self-Learning with Weak Supervision for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We study the problem of federated continual incremental learning for recurrent neural network-transducer (RNN-T) ASR models in the privacy-enhancing scheme of learning on-device, without access to ground truth human transcripts or machine transcriptions from a stronger ASR model. |
M. RAO et. al. | icassp | 2023-04-27 |
498 | Resource-Efficient Transfer Learning from Speech Foundation Model Using Hierarchical Feature Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. |
Z. HUO et. al. | icassp | 2023-04-27 |
499 | Slot-Triggered Contextual Biasing For Personalized Speech Recognition Using Neural Transducers IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method whereby the E2E ASR model is trained to emit opening and closing tags around slot content which are used to both selectively enable biasing and decide which catalog to use for biasing. |
S. Tong; P. Harding; S. Wiesler; | icassp | 2023-04-27 |
500 | Avoid Overthinking in Self-Supervised Models for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We then motivate further research in EE by computing an optimal bound for performance versus speed trade-offs. To approach this bound we propose two new strategies for ASR: (1) we adapt the recently proposed patience strategy to ASR; and (2) we design a new EE strategy specific to ASR that performs better than all strategies previously introduced. |
D. Berrebbi; B. Yan; S. Watanabe; | icassp | 2023-04-27 |
501 | Robust Multi-modal Speech Emotion Recognition with ASR Error Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes an SER method robust to ASR errors. |
B. Lin; L. Wang; | icassp | 2023-04-27 |
502 | Towards Improved Room Impulse Response Estimation for Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of a downstream application scenario, far-field automatic speech recognition (ASR). |
A. RATNARAJAH et. al. | icassp | 2023-04-27 |
503 | Towards Accurate and Real-Time End-of-Speech Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a variant of the endpoint (EP) detection problem in automatic speech recognition (ASR), which we call the end-of-speech (EOS) estimation. |
Y. FAN et. al. | icassp | 2023-04-27 |
504 | Conversation-Oriented ASR with Multi-Look-Ahead CBS Architecture Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In streaming ASR, high accuracy is assured by attending to look-ahead frames, which leads to delay increments. To tackle this trade-off issue, we propose a multiple latency streaming ASR to achieve high accuracy with zero look-ahead. |
H. ZHAO et. al. | icassp | 2023-04-27 |
505 | Self-Supervised Learning-Based Source Separation for Meeting Data Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, seven SSL models were compared on both simulated and real-world corpora. |
Y. Li; X. Zheng; P. C. Woodland; | icassp | 2023-04-27 |
506 | Improving Noisy Student Training on Non-Target Domain Data for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a data selection strategy named LM Filter to improve the performance of NST on non-target domain data in ASR tasks. |
Y. Chen; W. Ding; J. Lai; | icassp | 2023-04-27 |
507 | Representation of Vocal Tract Length Transformation Based on Group Theory Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we focus on the property of vocal tract length transformation (VTLT) that forms a group, and derive the novel speech representation VTL spectrum based on group theory analysis, where only the phase of the VTL spectrum is changed by VTLT, which is a simple linear shift. |
A. Miyashita; T. Toda; | icassp | 2023-04-27 |
508 | Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents novel variational auto-encoder generative adversarial network (VAE-GAN) based personalized disordered speech augmentation approaches that simultaneously learn to encode, generate and discriminate synthesized impaired speech. |
Z. JIN et. al. | icassp | 2023-04-27 |
509 | Database-Aware ASR Error Correction for Speech-to-SQL Parsing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose an ASR correction method, DBATI (DataBase-Aware TaggerILM). |
Y. Shao; A. Kumar; N. Nakashole; | icassp | 2023-04-27 |
510 | Conformer-Based Target-Speaker Automatic Speech Recognition For Single-Channel Audio Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose CONF-TSASR, a non-autoregressive end-to-end time-frequency domain architecture for single-channel target-speaker automatic speech recognition (TS-ASR). |
Y. Zhang; K. C. Puvvada; V. Lavrukhin; B. Ginsburg; | icassp | 2023-04-27 |
511 | Multi-Temporal Lip-Audio Memory for Visual Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. |
J. H. Yeo; M. Kim; Y. M. Ro; | icassp | 2023-04-27 |
512 | Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a simple yet effective approach called gradient remedy (GR) to solve interference between task gradients in noise-robust speech recognition, from perspectives of both angle and magnitude. |
Y. Hu; C. Chen; R. Li; Q. Zhu; E. S. Chng; | icassp | 2023-04-27 |
513 | Multi-Blank Transducers for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. |
H. Xu; F. Jia; S. Majumdar; S. Watanabe; B. Ginsburg; | icassp | 2023-04-27 |
514 | De’hubert: Disentangling Noise in A Self-Supervised Model for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a novel training framework, called deHuBERT, for noise reduction encoding inspired by H. Barlow’s redundancy-reduction principle. |
D. NG et. al. | icassp | 2023-04-27 |
515 | WL-MSR: Watch and Listen for Multimodal Subtitle Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel Watch and Listen for Multimodal Subtitle Recognition (WL-MSR) framework to obtain comprehensive video subtitles, by fusing the information provided by Optical Character Recognition (OCR) and Automatic Speech Recognition (ASR) models. |
J. Liu; H. Wang; W. Wang; X. He; J. Liu; | icassp | 2023-04-27 |
516 | Understanding Shared Speech-Text Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we expand our understanding of the resulting shared speech-text representations with two types of analyses. |
G. WANG et. al. | icassp | 2023-04-27 |
517 | Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to factorize out the language component in the AED model, we propose the factorized attention-based encoder-decoder (Factorized AED) model whose decoder takes as input the posterior probabilities of a jointly trained LM. |
X. Gong; W. Wang; H. Shao; X. Chen; Y. Qian; | icassp | 2023-04-27 |
518 | An Analysis of Degenerating Speech Due to Progressive Dysarthria on ASR Performance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The aims of this study were to (1) analyze the change of performance of ASR over time in individuals with degrading speech, and (2) explore mitigation strategies to optimize recognition throughout disease progression. |
K. TOMANEK et. al. | icassp | 2023-04-27 |
519 | Understanding Shared Speech-Text Representations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we expandour understanding of the resulting shared speech-text representationswith two types of analyses. |
GARY WANG et. al. | arxiv-cs.CL | 2023-04-27 |
520 | Align, Write, Re-Order: Explainable End-to-End Speech Translation Via Operation Sequence Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The black-box nature of end-to-end speech-to-text translation (E2E ST) makes it difficult to understand how source language inputs are being mapped to the target language. To solve this problem, we propose to simultaneously generate automatic speech recognition (ASR) and ST predictions such that each source language word is explicitly mapped to a target language word. |
M. Omachi; B. Yan; S. Dalmia; Y. Fujita; S. Watanabe; | icassp | 2023-04-27 |
521 | Code-Switching Text Generation and Injection in Mandarin-English ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. |
H. YU et. al. | icassp | 2023-04-27 |
522 | Improving Non-Autoregressive Speech Recognition with Autoregressive Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose AR pretraining to the NAR encoder to reduce the accuracy gap between AR and NAR models. |
Y. Li; L. Samarakoon; I. Fung; | icassp | 2023-04-27 |
523 | Improving Fairness and Robustness in End-to-End Speech Recognition Through Unsupervised Clustering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a privacy preserving approach to improve fairness and robustness of end-to-end ASR without using metadata, zip codes, or even speaker or utterance embeddings directly in training. |
I. -E. Veliche; P. Fung; | icassp | 2023-04-27 |
524 | Anchored Speech Recognition with Neural Transducers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate anchored speech recognition to make neural transducers robust to background speech. |
D. RAJ et. al. | icassp | 2023-04-27 |
525 | Visual Information Matters for ASR Error Correction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The other is that the community lacks a high-quality benchmark where visual information matters for the EC models. Therefore, this paper provides 1) simple yet effective methods, namely gated fusion and image captions as prompts to incorporate visual information to help EC; 2) large-scale benchmark datasets, namely Visual-ASR-EC, where each item in the training data consists of visual, speech, and text information, and the test data are carefully selected by human annotators to ensure that even humans could make mistakes when visual information is missing. |
V. B. Kumar; S. Cheng; N. Peng; Y. Zhang; | icassp | 2023-04-27 |
526 | Self-Supervised Representations in Speech-Based Depression Detection IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes handling training data sparsity in speech-based automatic depression detection (SDD) using foundation models pre-trained with self-supervised learning (SSL). |
W. Wu; C. Zhang; P. C. Woodland; | icassp | 2023-04-27 |
527 | Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We show that V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset. |
L. -W. Chen; A. Rudnicky; | icassp | 2023-04-27 |
528 | LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. |
X. GONG et. al. | icassp | 2023-04-27 |
529 | Self-Adaptive Incremental Machine Speech Chain for Lombard TTS with High-Granularity ASR Feedback in Dynamic Noise Condition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we improve the self-adaptive TTS using character-vocabulary level ASR feedback at higher granularity, considering the losses in the positive and negative classes. |
S. Novitasari; S. Sakti; S. Nakamura; | icassp | 2023-04-27 |
530 | Wav2Seq: Pre-Training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. |
F. WU et. al. | icassp | 2023-04-27 |
531 | Structured State Space Decoder for Speech Recognition and Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks, respectively, by comparing it with the Transformer decoder. |
K. Miyazaki; M. Murata; T. Koriyama; | icassp | 2023-04-27 |
532 | Self-Supervised Learning with Bi-Label Masked Speech Prediction for Streaming Multi-Talker Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate SSL for streaming multi-talker speech recognition, which generates transcriptions of overlapping speakers in a streaming fashion. |
Z. HUANG et. al. | icassp | 2023-04-27 |
533 | Towards Zero-Shot Code-Switched Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot set-ting where no transcribed CS speech data is available for training. |
B. Yan; M. Wiesner; O. Klejch; P. Jyothi; S. Watanabe; | icassp | 2023-04-27 |
534 | Context-Aware Fine-Tuning of Self-Supervised Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study the use of context, i.e., surrounding segments, during fine-tuning and propose a new approach called context-aware fine-tuning. |
S. SHON et. al. | icassp | 2023-04-27 |
535 | Enhancing Unsupervised Speech Recognition with Diffusion GANS Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We enhance the vanilla adversarial training method for unsupervised Automatic Speech Recognition (ASR) by a diffusionGAN. |
X. Wu; | icassp | 2023-04-27 |
536 | Vararray Meets T-Sot: Advancing The State of The Art of Streaming Distant Conversational Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. |
N. KANDA et. al. | icassp | 2023-04-27 |
537 | MADI: Inter-Domain Matching and Intra-Domain Discrimination for Cross-Domain Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a novel UDA approach for ASR via inter-domain MAtching and intra-domain DIscrimination (MADI), which improves the model transferability by fine-grained inter-domain matching and discriminability by intra-domain contrastive discrimination simultaneously. |
J. Zhou; S. Zhao; N. Jiang; G. Zhao; Y. Qin; | icassp | 2023-04-27 |
538 | Iterative Shallow Fusion of Backward Language Model for End-To-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). |
A. Ogawa; T. Moriya; N. Kamo; N. Tawara; M. Delcroix; | icassp | 2023-04-27 |
539 | UFO2: A Unified Pre-Training Framework for Online and Offline Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a Unified pre-training Framework for Online and Offline (UFO2) Automatic Speech Recognition (ASR), which 1) simplifies the two separate training workflows for online and offline modes into one process, and 2) improves the Word Error Rate (WER) performance with limited utterance annotating. |
L. FU et. al. | icassp | 2023-04-27 |
540 | Improving Accented Speech Recognition with Multi-Domain Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we use speech audio representing four different French accents to create fine-tuning datasets that improve the robustness of pre-trained ASR models. |
L. Maison; Y. Esteve; | icassp | 2023-04-27 |
541 | MoLE : Mixture Of Language Experts For Multi-Lingual Automatic Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a multi-lingual speech recognition network named Mixture-of-Language-Experts (MoLE), which digests speech in a variety of languages. |
Y. Kwon; S. -W. Chung; | icassp | 2023-04-27 |
542 | Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget. |
R. Gody; D. Harwath; | icassp | 2023-04-27 |
543 | A Speech Representation Anonymization Framework Via Selective Noise Perturbation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a speech anonymization framework that achieves privacy via noise perturbation to a selected subset of the high-utility representations extracted using a pre-trained speech encoder. |
M. Tran; M. Soleymani; | icassp | 2023-04-27 |
544 | From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition, which can re-purpose well-trained English automatic speech recognition (ASR) models to recognize the other languages. |
C. -H. H. YANG et. al. | icassp | 2023-04-27 |
545 | Simulating Realistic Speech Overlaps Improves Multi-Talker ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose an improved technique to simulate multi-talker overlap-ping speech with realistic speech overlaps, where an arbitrary pattern of speech overlaps is represented by a sequence of discrete tokens. |
M. YANG et. al. | icassp | 2023-04-27 |
546 | Context-Aware End-to-end ASR Using Self-Attentive Embedding and Tensor Fusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose a context-aware end-to-end ASR model that injects the self-attentive context embedding into the decoder of the recurrent neural network transducer (RNN-T). |
S. -Y. Chang; C. Zhang; T. N. Sainath; B. Li; T. Strohman; | icassp | 2023-04-27 |
547 | Multi-modal ASR Error Correction with Joint ASR Error Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To include the audio information for better error correction, we propose a sequence-to-sequence multi-modal ASR error correction model. |
B. Lin; L. Wang; | icassp | 2023-04-27 |
548 | Cleanformer: A Multichannel Array Configuration-Invariant Neural Enhancement Frontend for ASR in Smart Speakers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces Cleanformer —a streaming multichannel neural enhancement frontend for automatic speech recognition (ASR). |
J. Caroselli; A. Narayanan; N. Howard; T. O’Malley; | icassp | 2023-04-27 |
549 | UML: A Universal Monolingual Output Layer For Multilingual Asr Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address such problems. |
C. Zhang; B. Li; T. N. Sainath; T. Strohman; S. -Y. Chang; | icassp | 2023-04-27 |
550 | Bridging Speech and Textual Pre-Trained Models With Unsupervised ASR Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To be specific, we propose to use unsupervised automatic speech recognition (ASR) as a connector that bridges different modalities used in speech and textual pre-trained models. |
J. SHI et. al. | icassp | 2023-04-27 |
551 | Investigation Into Phone-Based Subword Units for Multilingual End-to-End Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we investigate the use of phone-based sub-words, specifically Byte Pair Encoding (BPE), as modeling units for multilingual end-to-end speech recognition. |
S. Yusuyin; H. Huang; J. Liu; C. Liu; | icassp | 2023-04-27 |
552 | Cascading and Direct Approaches to Unsupervised Constituency Parsing on Spoken Sentences Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we present the first study on unsupervised spoken constituency parsing given unlabeled spoken sentences and unpaired textual data. |
Y. Tseng; C. -I. J. Lai; H. -Y. Lee; | icassp | 2023-04-27 |
553 | Pretraining Conformer with ASR for Speaker Verification IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes to pretrain Conformer with automatic speech recognition (ASR) task for speaker verification. |
D. Cai; W. Wang; M. Li; R. Xia; C. Huang; | icassp | 2023-04-27 |
554 | Adaptive Multi-Corpora Language Model Training for Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce a novel adaptive multi-corpora training algorithm that dynamically learns and adjusts the sampling probability of each corpus along the training process. |
Y. Ma; Z. Liu; X. Zhang; | icassp | 2023-04-27 |
555 | Learning ASR Pathways: A Sparse Multilingual ASR Model IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, in multilingual ASR, language-agnostic pruning may lead to severe performance drops on some languages because language-agnostic pruning masks may not fit all languages and discard important language-specific parameters. In this work, we present ASR pathways, a sparse multilingual ASR model that activates language-specific sub-networks (pathways), such that the parameters for each language are learned explicitly. |
M. YANG et. al. | icassp | 2023-04-27 |
556 | Multi-Speaker Data Augmentation for Improved End-to-end Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While E2E ASR models achieve state-of-the-art performance on recognition tasks that match well with such training data, they are observed to fail on test recordings that contain multiple speakers, significant channel or background noise or span longer durations than training data utterances. To mitigate these issues, we propose an on-the-fly data augmentation strategy that transforms single speaker training data into multiple speaker data by appending together multiple single speaker utterances. |
S. Thomas; H. -K. J. Kuo; G. Saon; B. Kingsbury; | icassp | 2023-04-27 |
557 | Speech and Noise Dual-Stream Spectrogram Refine Network With Speech Distortion Loss For Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper, we propose a dual-stream spectrogram refine network to simultaneously refine the speech and noise and decouple the noise from the noisy input. |
H. LU et. al. | icassp | 2023-04-27 |
558 | Speech-Text Based Multi-Modal Training with Bidirectional Attention for Improved Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method. |
Y. Yang; H. Xu; H. Huang; E. S. Chng; S. Li; | icassp | 2023-04-27 |
559 | HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit Bert for Robust Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose HuBERT-AGG, a novel method that learns noise-invariant SSL representations for robust speech recognition by distilling aggregated layer-wise representations. |
W. Wang; Y. Qian; | icassp | 2023-04-27 |
560 | Continual Learning for On-Device Speech Recognition Using Disentangled Conformers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This algorithm produces ASR models consisting of a frozen ‘core’ network for general-purpose use and several tunable ‘augment’ networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. |
A. DIWAN et. al. | icassp | 2023-04-27 |
561 | Self-Convolution for Automatic Speech Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Accordingly, we take their complementary advantages and propose a new module, namely self-convolution, to compensate for each individual limitations. |
T. -H. ZHANG et. al. | icassp | 2023-04-27 |
562 | Joint Unsupervised and Supervised Learning for Context-Aware Language Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this problem, we propose context-aware language identification using a combination of unsupervised and supervised learning without any text labels. |
J. PARK et. al. | icassp | 2023-04-27 |
563 | Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-to-Speech IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes Virtuoso, a massively multilingual speech–text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. |
T. SAEKI et. al. | icassp | 2023-04-27 |
564 | Using Automatic Speech Recognition to Measure The Intelligibility of Speech Synthesized From Brain Signals Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Brain-computer interfaces (BCIs) can potentially restore lost function in patients with neurological injury. A promising new application of BCI technology has focused on speech … |
Suvi Varshney; D. Farias; David M. Brandman; S. Stavisky; Lee M. Miller; | 2023 11th International IEEE/EMBS Conference on Neural … | 2023-04-24 |
565 | Situating Automatic Speech Recognition Development Within Communities of Under-heard Language Speakers Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: In this paper we develop approaches to automatic speech recognition (ASR) development that suit the needs and functions of under-heard language speakers. Our novel contribution to … |
THOMAS REITMAIER et. al. | Proceedings of the 2023 CHI Conference on Human Factors in … | 2023-04-19 |
566 | Collaboratively Mitigating Racial Disparities in Automated Speech Recognition and Language Technologies with African American English Speakers: Community-Collaborative and Equity-Centered Approaches Toward Designing Inclusive Natural Language Systems Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Automated speech recognition (ASR) systems that rely on natural language processing (NLP) techniques are becoming increasingly prevalent within people’s everyday lives. From … |
Jay L. Cunningham; | Extended Abstracts of the 2023 CHI Conference on Human … | 2023-04-19 |
567 | Improving Automatic Summarization for Browsing Longform Spoken Dialog Summary Related Papers Related Patents Related Grants Related Venues Related Experts View Abstract: Longform spoken dialog delivers rich streams of informative content through podcasts, interviews, debates, and meetings. While production of this medium has grown tremendously, … |
Daniel Li; Thomas Chen; Alec Zadikian; Albert Tung; Lydia B. Chilton; | Proceedings of the 2023 CHI Conference on Human Factors in … | 2023-04-19 |
568 | Speech Command Recognition Based on Convolutional Spiking Neural Networks Summary Related Papers |