The paper on these technologies has been accepted to INTERSPEECH 2024, the world's leading conference on the science and technology of spoken language processing.
Automatic speech recognition (ASR) is a technology that recognizes and analyzes spoken words, voices, and conversations, converting them into text data for output. This technology is widely used in business, such as displaying subtitles during meetings, creating meeting minutes, and generating reports. ASR allows for faster transcription and data entry into systems compared to manual typing, making it an extremely effective tool for improving operational efficiency.
Training of ASR faces the following challenges:
Ricoh's training method for ASR solves issues related to the training costs and the robustness to audio quality to help the realization of the followings:
Training methods of Ricoh ASR feature:
Ricoh's novel training method configures a pre-trained model for ASR through the two new technologies below.
The self-supervised learning method for ASR predicts the labels (recognition targets) from the input speech without the transcript.
Deterministic computation uniquely derives the labels used for the self-supervised learning from the input speech without exploiting statistical distributions of the dataset. The derived labels are dominated by the characteristics of the phonemes (smallest unit of sound to determine meaning) of the input speech. Therefore, the simple end-to-end training (unitary training from the inputs to the final outputs) realizes a pre-trained model with high adaptability to speech recognition tasks.
This method augments the training data in order to enhance the performance of speech analysis by giving the capability to focus on only the phonemes in speech, which is necessary for speech recognition, in any acoustic environments rather than the data augmentation to approximate the training environment to a particular usage environment. Specifically, Ricoh's data augmentation generates speech with a wide range of intelligibility between original clear speech and barely audible speech during training. The ASR model learns to output the same results for the same utterances with different levels of intelligibility. This data augmentation technology has successfully strengthened and enhanced the acoustic environment tolerance of ASR.
For more information about these technologies, please see Self-Supervised Learning for ASR Pre-Training with Uniquely Determined Target Labels and Controlling Cepstrum Truncation for Speech Augmentation accepted by the INTERSPEECH 2024 Conference.
Ricoh aims to use digital services to realize a workplace design that supports the creativity of people in their work. We have earmarked process automation as a growth area and help customers around the world to increase the efficiency and advancement of their operations through broad integrated solutions. We will strive to further research and develop AI speech recognition as an AI technology supporting digital services and process automation.