Self-supervised learning and data augmentation technologies for automatic speech recognition (ASR)

Novel methods to enhance training efficiency and recognition accuracy of automatic speech recognition

The paper on these technologies has been accepted to INTERSPEECH 2024, the world's leading conference on the science and technology of spoken language processing.

Background

Automatic speech recognition (ASR) is a technology that recognizes and analyzes spoken words, voices, and conversations, converting them into text data for output. This technology is widely used in business, such as displaying subtitles during meetings, creating meeting minutes, and generating reports. ASR allows for faster transcription and data entry into systems compared to manual typing, making it an extremely effective tool for improving operational efficiency.

Training of ASR faces the following challenges:

  • Traditionally, supervised learning for ASR require speech data paired with corresponding transcripts to teach the artificial intelligence (AI) the relationship between that speech and the text, but the transcription is costly. To mitigate this cost, recent research is looking into methods to build a pre-trained model with a large volume of speech without transcripts, and then it is fine-tuned with only a very small volume of transcribed speech data.
  • The audio quality recorded in real-world environments can vary depending on factors such as application and location, necessitating enhanced tolerance to acoustic noise for broader usability across different settings.

Solutions

Ricoh's training method for ASR solves issues related to the training costs and the robustness to audio quality to help the realization of the followings:

  • Robust recognition of casual conversations between people or, as well as voices recorded at a distance from the microphone, even where there is noise or reverberations for easy use of ASR.
  • Support for diverse work styles as a process automation tool for speech communication in workplaces shared by multiple people, such as automatic creation of meeting minutes and reports, display of subtitles during meetings, and verbal dialogues with the AI agents.
Meeting image

Meeting image

Technical highlights

Training methods of Ricoh ASR feature:

  • The novel self-supervised learning method (training method using only speech without transcripts) simplifies the process to generate the labels for self-supervised learning to greatly reduce the technical difficulty for developers building their own pre-trained models.
  • The data augmentation technology to produce various intelligibility from original speech can broadly enhance acoustic noise tolerance in any environment regardless of noise types.

Ricoh's novel training method configures a pre-trained model for ASR through the two new technologies below.

1. Self-supervised learning method

The self-supervised learning method for ASR predicts the labels (recognition targets) from the input speech without the transcript.

Deterministic computation uniquely derives the labels used for the self-supervised learning from the input speech without exploiting statistical distributions of the dataset. The derived labels are dominated by the characteristics of the phonemes (smallest unit of sound to determine meaning) of the input speech. Therefore, the simple end-to-end training (unitary training from the inputs to the final outputs) realizes a pre-trained model with high adaptability to speech recognition tasks.

2. Data augmentation method

This method augments the training data in order to enhance the performance of speech analysis by giving the capability to focus on only the phonemes in speech, which is necessary for speech recognition, in any acoustic environments rather than the data augmentation to approximate the training environment to a particular usage environment. Specifically, Ricoh's data augmentation generates speech with a wide range of intelligibility between original clear speech and barely audible speech during training. The ASR model learns to output the same results for the same utterances with different levels of intelligibility. This data augmentation technology has successfully strengthened and enhanced the acoustic environment tolerance of ASR.

For more information about these technologies, please see Self-Supervised Learning for ASR Pre-Training with Uniquely Determined Target Labels and Controlling Cepstrum Truncation for Speech Augmentation accepted by the INTERSPEECH 2024 Conference.

Ricoh's vision

Ricoh aims to use digital services to realize a workplace design that supports the creativity of people in their work. We have earmarked process automation as a growth area and help customers around the world to increase the efficiency and advancement of their operations through broad integrated solutions. We will strive to further research and develop AI speech recognition as an AI technology supporting digital services and process automation.

PAGE TOP