论文标题
ZIPF法律估计器的偏见
Bias in Zipf's Law Estimators
论文作者
论文摘要
从等级频率数据推断功率定律模型的主要最大似然估计器有偏见。这种偏见的来源是不适当的可能性函数。正确的可能性函数被得出并证明在计算上是棘手的。探索了一种更有效的近似贝叶斯计算(ABC)的方法。该方法证明,从理想化的等级 - 频率Zipfian分布产生的数据的偏差较小。但是,这里描述的现有估计器和ABC估计器假设单词是从简单的概率分布中得出的,而语言是一个更复杂的过程。我们表明,这种错误的假设在将这些方法应用于自然语言来估计ZIPF指数时会导致持续偏见。我们建议研究人员在调查等级频率数据中的权力法时意识到这些偏见。
The prevailing maximum likelihood estimators for inferring power law models from rank-frequency data are biased. The source of this bias is an inappropriate likelihood function. The correct likelihood function is derived and shown to be computationally intractable. A more computationally efficient method of approximate Bayesian computation (ABC) is explored. This method is shown to have less bias for data generated from idealised rank-frequency Zipfian distributions. However, the existing estimators and the ABC estimator described here assume that words are drawn from a simple probability distribution, while language is a much more complex process. We show that this false assumption leads to continued biases when applying any of these methods to natural language to estimate Zipf exponents. We recommend that researchers be aware of these biases when investigating power laws in rank-frequency data.