论文标题

俄罗斯自然语言生成:创建语言建模数据集和现代神经体系结构的评估

Russian Natural Language Generation: Creation of a Language Modelling Dataset and Evaluation with Modern Neural Architectures

论文作者

Shaheen, Zein, Wohlgenannt, Gerhard, Zaity, Bassel, Mouromtsev, Dmitry, Pak, Vadim

论文摘要

生成连贯的,语法正确且有意义的文本非常具有挑战性,但是,这对于许多现代的NLP系统至关重要。到目前为止,研究主要集中在英语上,对于其他标准化数据集以及对最先进模型的实验的其他语言很少见。在这项工作中,我们i)为俄罗斯语言建模提供了新颖的参考数据集,ii)实验文本生成的流行现代方法,即变体自动编码器和生成对抗性网络,我们在新数据集中培训了这些方法。我们评估了有关指标的生成文本,例如困惑,语法正确性和词汇多样性。

Generating coherent, grammatically correct, and meaningful text is very challenging, however, it is crucial to many modern NLP systems. So far, research has mostly focused on English language, for other languages both standardized datasets, as well as experiments with state-of-the-art models, are rare. In this work, we i) provide a novel reference dataset for Russian language modeling, ii) experiment with popular modern methods for text generation, namely variational autoencoders, and generative adversarial networks, which we trained on the new dataset. We evaluate the generated text regarding metrics such as perplexity, grammatical correctness and lexical diversity.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源