改进的基于SMS文本标准化的基于贝叶斯Trie的模型

论文标题

改进的基于SMS文本标准化的基于贝叶斯Trie的模型

An improved Bayesian TRIE based model for SMS text normalization

论文作者

Sikdar, Abhinava, Chatterjee, Niladri

论文摘要

SMS文本的归一化（通常称为短信语言）已被追求十多年。在文献中提出了一种基于TRIE数据结构的概率方法，该方法比早些时候在预测不循序渐进的单词的正确替代方案时所提出的基于HMM的方法表现更好。但是，基于TRIE的方法的成功在很大程度上取决于估计单词出现的基本概率的正确性。在这项工作中，我们建议对现有的基于TRIE的模型进行结构修改以及新颖的培训算法和概率生成方案。我们证明了有关拟议的Trie的统计属性的两个定理，并使用它们声称它们是对单词的发生概率的公正且一致的估计器。我们将模型进一步融合到基于嘈杂的通道误差校正的范式中，并提供了一种启发式，以超越damerau levenshtein的距离。我们还运行模拟以支持我们的主张，并显示出与以前的作品相比提出的计划的优越性。

Normalization of SMS text, commonly known as texting language, is being pursued for more than a decade. A probabilistic approach based on the Trie data structure was proposed in literature which was found to be better performing than HMM based approaches proposed earlier in predicting the correct alternative for an out-of-lexicon word. However, success of the Trie based approach depends largely on how correctly the underlying probabilities of word occurrences are estimated. In this work we propose a structural modification to the existing Trie-based model along with a novel training algorithm and probability generation scheme. We prove two theorems on statistical properties of the proposed Trie and use them to claim that is an unbiased and consistent estimator of the occurrence probabilities of the words. We further fuse our model into the paradigm of noisy channel based error correction and provide a heuristic to go beyond a Damerau Levenshtein distance of one. We also run simulations to support our claims and show superiority of the proposed scheme over previous works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题