论文标题

堆的定律和堆在标记的文本中起作用:语言相关性的证据

Heaps' law and Heaps functions in tagged texts: Evidences of their linguistic relevance

论文作者

Chacoma, Andrés, Zanette, Damián H.

论文摘要

我们研究词汇大小和文本长度之间的关系,英语的文学作品为75美元,由六位作家撰写,区分三个语法类别(或``标签''的贡献,即{\ it nouns}}尽管堆规则规定的权力关系是通过词汇量和文本长度令人满意地实现的,但每个文本中新单词的外观总体上都很好地描述了文本的随机随机改组的平均值,这并不遵守权力法。然而,与该平均值的偏差具有统计学意义,并显示了整个语料库的系统趋势。具体而言,他们揭示了每个文本的新单词的外观主要相对于随机洗牌的平均值而言。此外,显示出不同的标签可以在这种趋势上增加系统上不同的贡献,而{\ it动词}和{\ it其他}比平均趋势越来越弱智,而{\ it nouns}则是整体平均值。这些统计系统性很可能指出存在于堆的不同变体中的语言相关信息的存在,这一功能仍然需要广泛的评估。

We study the relationship between vocabulary size and text length in a corpus of $75$ literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or ``tags,'' namely, {\it nouns}, {\it verbs}, and {\it others}), and analyze the progressive appearance of new words of each tag along each individual text. While the power-law relation prescribed by Heaps' law is satisfactorily fulfilled by total vocabulary sizes and text lengths, the appearance of new words in each text is on the whole well described by the average of random shufflings of the text, which does not obey a power law. Deviations from this average, however, are statistically significant and show a systematic trend across the corpus. Specifically, they reveal that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags are shown to add systematically distinct contributions to this tendency, with {\it verbs} and {\it others} being respectively more and less retarded than the mean trend, and {\it nouns} following instead this overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' law, a feature that is still in need of extensive assessment.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源