一个有效的架构，用于使用序列模型预测字符的情况

论文标题

一个有效的架构，用于使用序列模型预测字符的情况

An Efficient Architecture for Predicting the Case of Characters using Sequence Models

论文作者

Ramena, Gopi, Nagaraju, Divija, Moharana, Sukumar, Mohanty, Debi Prasanna, Purre, Naresh

论文摘要

在几种自然语言处理应用中，干净的文本数据的缺乏通常是瓶颈。可用的数据通常缺乏适当的情况（大写或小写）信息。当从社交媒体，消息传递应用程序和其他在线平台获得文本时，这通常会出现。本文试图通过恢复正确的字符案例（通常称为truecasing）来解决此问题。这样做可以提高NLP管道中几个处理任务的准确性。我们提出的架构结合了卷积神经网络（CNN），双向长期短期存储网络（LSTM）和条件随机字段（CRF）的组合，它们在角色级别上起作用，无需任何显式特征工程。在这项研究中，我们将我们的方法与以前的基于统计和深度学习的方法进行了比较。我们的方法显示，在当前状态下，F1得分的增量为0.83。由于TrueCasing在几个应用程序中充当预处理步骤，因此F1分数的每一个增量都会显着改善语言处理任务。

The dearth of clean textual data often acts as a bottleneck in several natural language processing applications. The data available often lacks proper case (uppercase or lowercase) information. This often comes up when text is obtained from social media, messaging applications and other online platforms. This paper attempts to solve this problem by restoring the correct case of characters, commonly known as Truecasing. Doing so improves the accuracy of several processing tasks further down in the NLP pipeline. Our proposed architecture uses a combination of convolutional neural networks (CNN), bi-directional long short-term memory networks (LSTM) and conditional random fields (CRF), which work at a character level without any explicit feature engineering. In this study we compare our approach to previous statistical and deep learning based approaches. Our method shows an increment of 0.83 in F1 score over the current state of the art. Since truecasing acts as a preprocessing step in several applications, every increment in the F1 score leads to a significant improvement in the language processing tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题