Large scale speech-to-text translation with out-of-domain corpora using better context-based models and domain adaptation
Authors:
- Marcin Junczys-Dowmunt,
- Paweł Przybysz,
- Arleta Staszuk,
- Eun-Kyoung Kim,
- Jae Won Lee
Abstract
In this paper, we described the process of building a large-scale speech-to-text pipeline. Two target domains, daily conversations and travel-related conversations between two agents, for the English-German language pair (both directions) are examined. The SMT component is built from out-of-domain but freely-available bilingual and monolingual data. We make use of most of the known available resources to examine the effects of unrestricted data and large scale models. A naive baseline delivers solid results in terms of MT-quality. Extending the baseline with context-based translation model features like operations sequence models, higher-order class-based language models, and additional web-scale word-based language models leads to a system that significantly outperforms the baseline. Domain adaption is performed by separately weighting the influence of the out-of-domain subcorpora. This is explored for translation models and language models yielding significant improvements in both cases. Automatic and manual evaluation results are provided for raw MT-quality and ASR+MT-quality.
- Record ID
- UAM94b11fcadab545afa88c6820183d6747
- Author
- Journal series
- Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, ISSN 2308-457X, e-ISSN 1990-9772
- Issue year
- 2015
- Vol
- 2015-January
- Pages
- 2272-2276
- Language
- (en) English
- Score (nominal)
- 0
- Score source
- journalList
- Score
- Publication indicators
- = 0
- Uniform Resource Identifier
- https://researchportal.amu.edu.pl/info/article/UAM94b11fcadab545afa88c6820183d6747/
- URN
urn:amu-prod:UAM94b11fcadab545afa88c6820183d6747
* presented citation count is obtained through Internet information analysis and it is close to the number calculated by the Publish or PerishOpening in a new tab system.