论文标题
UZBEKSTEMMER:开发乌兹别克语语言的基于规则的Stemming算法
UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language
论文作者
论文摘要
在本文中,我们提出了一种基于规则的乌兹别克语言的基于规则的词干算法。乌兹别克(Uzbek)是一种凝结的语言,因此通过添加后缀形成了很多单词,后缀的数量也很大。因此,很难找到单词词干。提出了该方法,用于使用词缀剥离方法进行乌兹别克词单词的词干,而不包括乌兹别克语语言的正常单词形式的任何数据库。单词词缀分为15个类,并根据形态规则设计为每个类的有限状态机器(FSM)。我们创建了15个FSM,并将它们链接在一起以创建基本FSM。创建了XML格式词缀的词典,并根据FSM开发了乌兹别克单词的茎应用。
In this paper we present a rule-based stemming algorithm for the Uzbek language. Uzbek is an agglutinative language, so many words are formed by adding suffixes, and the number of suffixes is also large. For this reason, it is difficult to find a stem of words. The methodology is proposed for doing the stemming of the Uzbek words with an affix stripping approach whereas not including any database of the normal word forms of the Uzbek language. Word affixes are classified into fifteen classes and designed as finite state machines (FSMs) for each class according to morphological rules. We created fifteen FSMs and linked them together to create the Basic FSM. A lexicon of affixes in XML format was created and a stemming application for Uzbek words has been developed based on the FSMs.