论文标题
带有文档结构注释的软件文档的并行评估数据集
A Parallel Evaluation Data Set of Software Documentation with Document Structure Annotation
论文作者
论文摘要
本文伴随着用于机器翻译的软件文档数据集,这是一个并行评估数据集,该数据集源自SAP帮助门户,我们将其发布给机器翻译社区出于研究目的。它提供了在公司软件文档领域调整和评估机器翻译系统的可能性,并有助于更广泛的评估方案。数据集包括英语对印地语,印度尼西亚人,马来语和泰语的语言对,因此也增加了许多低资源语言对的测试覆盖范围。与大多数由普通并行文本组成的评估数据集不同,该数据集中的段带有其他元数据,描述了文档上下文的结构信息。我们提供有关原点和创建,数据集的特殊性和特征以及机器翻译结果的见解。
This paper accompanies the software documentation data set for machine translation, a parallel evaluation data set of data originating from the SAP Help Portal, that we released to the machine translation community for research purposes. It offers the possibility to tune and evaluate machine translation systems in the domain of corporate software documentation and contributes to the availability of a wider range of evaluation scenarios. The data set comprises of the language pairs English to Hindi, Indonesian, Malay and Thai, and thus also increases the test coverage for the many low-resource language pairs. Unlike most evaluation data sets that consist of plain parallel text, the segments in this data set come with additional metadata that describes structural information of the document context. We provide insights into the origin and creation, the particularities and characteristics of the data set as well as machine translation results.