论文标题
罗马乌尔都语的词汇正常化的聚类框架
A Clustering Framework for Lexical Normalization of Roman Urdu
论文作者
论文摘要
罗马乌尔都语是用罗马脚本编写的乌尔都语语言的一种非正式形式,该语言在南亚广泛用于在线文本内容。它缺乏标准拼写,因此在自动语言处理过程中提出了一些归一化挑战。在本文中,我们提出了一个基于功能的聚类框架,用于罗马乌尔都语语料库的词汇归一化,其中包括语音算法urduphone,string匹配组件,基于特征的相似性函数以及簇算法算法LEX-var。 Urduphone将罗马乌尔都语字符串编码为基于发音的表示。字符串匹配组件处理使用罗马脚本编写乌尔都语时发生的字符级变化。
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.