End-to-end (E2E) models, which directly predict output character sequences
given input speech, are good candidates for on-device speech recognition. E2E
models, however, present numerous challenges: In order to be truly useful, such
models must decode speech utterances in a streaming fashion, in real time; they
must be robust to the long tail of use cases; they must be able to leverage
user-specific context (e.g., contact lists); and above all, they must be
extremely accurate. In this work, we describe our efforts at building an E2E
speech recognizer using a recurrent neural network transducer. In experimental
evaluations, we find that the proposed approach can outperform a conventional
CTC-based model in terms of both latency and accuracy in a number of evaluation
categories.
Описание
[1811.06621] Streaming End-to-end Speech Recognition For Mobile Devices
%0 Generic
%1 he2018streaming
%A He, Yanzhang
%A Sainath, Tara N.
%A Prabhavalkar, Rohit
%A McGraw, Ian
%A Alvarez, Raziel
%A Zhao, Ding
%A Rybach, David
%A Kannan, Anjuli
%A Wu, Yonghui
%A Pang, Ruoming
%A Liang, Qiao
%A Bhatia, Deepti
%A Shangguan, Yuan
%A Li, Bo
%A Pundak, Golan
%A Sim, Khe Chai
%A Bagby, Tom
%A Chang, Shuo-yiin
%A Rao, Kanishka
%A Gruenstein, Alexander
%D 2018
%K 2018 arxiv deep-learning google lstm mobile speech
%T Streaming End-to-end Speech Recognition For Mobile Devices
%U http://arxiv.org/abs/1811.06621
%X End-to-end (E2E) models, which directly predict output character sequences
given input speech, are good candidates for on-device speech recognition. E2E
models, however, present numerous challenges: In order to be truly useful, such
models must decode speech utterances in a streaming fashion, in real time; they
must be robust to the long tail of use cases; they must be able to leverage
user-specific context (e.g., contact lists); and above all, they must be
extremely accurate. In this work, we describe our efforts at building an E2E
speech recognizer using a recurrent neural network transducer. In experimental
evaluations, we find that the proposed approach can outperform a conventional
CTC-based model in terms of both latency and accuracy in a number of evaluation
categories.
@misc{he2018streaming,
abstract = {End-to-end (E2E) models, which directly predict output character sequences
given input speech, are good candidates for on-device speech recognition. E2E
models, however, present numerous challenges: In order to be truly useful, such
models must decode speech utterances in a streaming fashion, in real time; they
must be robust to the long tail of use cases; they must be able to leverage
user-specific context (e.g., contact lists); and above all, they must be
extremely accurate. In this work, we describe our efforts at building an E2E
speech recognizer using a recurrent neural network transducer. In experimental
evaluations, we find that the proposed approach can outperform a conventional
CTC-based model in terms of both latency and accuracy in a number of evaluation
categories.},
added-at = {2019-12-27T12:29:18.000+0100},
author = {He, Yanzhang and Sainath, Tara N. and Prabhavalkar, Rohit and McGraw, Ian and Alvarez, Raziel and Zhao, Ding and Rybach, David and Kannan, Anjuli and Wu, Yonghui and Pang, Ruoming and Liang, Qiao and Bhatia, Deepti and Shangguan, Yuan and Li, Bo and Pundak, Golan and Sim, Khe Chai and Bagby, Tom and Chang, Shuo-yiin and Rao, Kanishka and Gruenstein, Alexander},
biburl = {https://www.bibsonomy.org/bibtex/2c2b9c8442c4ed00207335a30fc273d1e/analyst},
description = {[1811.06621] Streaming End-to-end Speech Recognition For Mobile Devices},
interhash = {2738d2492bd3b60af55863da1441c97f},
intrahash = {c2b9c8442c4ed00207335a30fc273d1e},
keywords = {2018 arxiv deep-learning google lstm mobile speech},
note = {cite arxiv:1811.06621},
timestamp = {2019-12-27T12:29:18.000+0100},
title = {Streaming End-to-end Speech Recognition For Mobile Devices},
url = {http://arxiv.org/abs/1811.06621},
year = 2018
}