ZIPF基于法律的文本生成方法，用于解决实体提取中的不平衡

论文标题

ZIPF基于法律的文本生成方法，用于解决实体提取中的不平衡

A Zipf's Law-based Text Generation Approach for Addressing Imbalance in Entity Extraction

论文作者

Wang, Zhenhua, Ren, Ming, Gao, Dong, Li, Zhuang

论文摘要

实体提取对于跨不同领域的智能进步至关重要。然而，对其有效性的挑战是数据失衡引起的。本文通过定量信息来查看问题，建议实体表现出某些级别的共性，而其他方法则稀缺，这可以反映在单词的可量化分布中，从而提出了一种新颖的方法。 ZIPF的定律是一种非常适合的采用，并且要从单词过渡到实体，文档中的单词被归类为常见和稀有的单词。随后，句子被归类为常见和罕见的句子，并通过文本生成模型进一步处理。然后，使用人设计的规则将生成句子中的稀有实体标记为对原始数据集的补充，从而减轻了不平衡问题。该研究提出了从技术文档中提取实体的案例，两个数据集的实验结果证明了该方法的有效性。此外，讨论了ZIPF定律对推动AI进度的重要性，从而扩大了信息仪的覆盖范围和覆盖范围。本文通过ZIPF定律提出了将信息处理扩展到与AI接口的成功演示。

Entity extraction is critical in the intelligent advancement across diverse domains. Nevertheless, a challenge to its effectiveness arises from the data imbalance. This paper proposes a novel approach by viewing the issue through the quantitative information, recognizing that entities exhibit certain levels of commonality while others are scarce, which can be reflected in the quantifiable distribution of words. The Zipf's Law emerges as a well-suited adoption, and to transition from words to entities, words within the documents are classified as common and rare ones. Subsequently, sentences are classified into common and rare ones, and are further processed by text generation models accordingly. Rare entities within the generated sentences are then labeled using human-designed rules, serving as a supplement to the raw dataset, thereby mitigating the imbalance problem. The study presents a case of extracting entities from technical documents, and experimental results from two datasets prove the effectiveness of the proposed method. Furthermore, the significance of Zipf's law in driving the progress of AI is discussed, broadening the reach and coverage of Informetrics. This paper presents a successful demonstration of extending Informetrics to interface with AI through Zipf's Law.

下载PDF全文

下载文献需遵守相关版权规定

论文标题