论文标题

单词频率分布的经验结构

The empirical structure of word frequency distributions

论文作者

Ramscar, Michael

论文摘要

单个单词在语言中发生的频率遵循权力法分布,这是一种称为ZIPF定律的发现模式。大量文献认为,这是否有助于优化人类交流的效率,但是这种说法必然是事后事后,并且有人建议ZIPF的定律实际上可以描述其他分布的混合物。从这个角度来看,最近的发现首先是Sinosphere的(家庭)名称是几何分布的,因为这实际上与有关最佳编码的信息理论预测一致。大多数语言中的自然交流分布形成了名字,我表明,与使用它们所使用的社区有关的分析时,各种语言的名字分布既是几何形状,又是历史上非常相似的,只有在经验分布汇总时才会出现电力法。然后,我在英语名词和动词的交流分布中显示了这种发现模式。这些结果表明,如果词汇分布支持有效的沟通,它们之所以这样做,是因为它们的功能结构直接满足信息理论所描述的约束,而不是因为ZIPF的定律。了解这些信息结构的功能可能是解释人类非凡的交流能力的关键。

The frequencies at which individual words occur across languages follow power law distributions, a pattern of findings known as Zipf's law. A vast literature argues over whether this serves to optimize the efficiency of human communication, however this claim is necessarily post hoc, and it has been suggested that Zipf's law may in fact describe mixtures of other distributions. From this perspective, recent findings that Sinosphere first (family) names are geometrically distributed are notable, because this is actually consistent with information theoretic predictions regarding optimal coding. First names form natural communicative distributions in most languages, and I show that when analyzed in relation to the communities in which they are used, first name distributions across a diverse set of languages are both geometric and, historically, remarkably similar, with power law distributions only emerging when empirical distributions are aggregated. I then show this pattern of findings replicates in communicative distributions of English nouns and verbs. These results indicate that if lexical distributions support efficient communication, they do so because their functional structures directly satisfy the constraints described by information theory, and not because of Zipf's law. Understanding the function of these information structures is likely to be key to explaining humankind's remarkable communicative capacities.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源