Towards End-To-End Automatic Speech Recognition
October 9th at 1:05 PM ET – 1:25 PM ET
Streaming automatic speech recognition (ASR) systems consist of a set of separate components, namely an acoustic model (AM); a pronunciation model (PM); a language model (LM), and an end pointer (EP). Traditionally, these components are trained independently on different datasets, with a number of independence assumptions made for tractability.
Over the last several years, there has been a growing interest in developing end-to-end systems, which attempt to learn these separate components jointly in a single neural network. When given input acoustic frames, it directly outputs a probability distribution over graphemes or word hypotheses. Such end-to-end models have shown to surpass the performance of a conventional ASR system.
In this talk, we will present a number of recently introduced innovations that have significantly improved the performance of end-to-end models. We will also discuss some of the shortcomings and ongoing efforts to address these challenges.
End-to-end ASR systems share similarities with Generative AI in healthcare, where AI models streamline processes by generating outcomes directly from input data, enhancing efficiency and accuracy across both fields.

Bo Li
Senior Research Scientist at Google Inc.
Bo Li received a Ph.D. degree in Computer Science from the School of Computing, the National University of Singapore in 2014, and a B.E. degree in Computer Engineering from the School of Computer, Northwestern Polytechnical University, China, in 2008.
He is currently a Senior Research Scientist at Google. His research interests are mainly in acoustic modeling for robust automatic speech recognition, including deep learning adaptation methods and lifelong learning.