论文标题
估计语言分布的熵
Estimating the Entropy of Linguistic Distributions
论文作者
论文摘要
香农熵通常是研究人类语言交流能力的语言学家的一定兴趣。但是,通常必须从观察到的数据中估算熵,因为研究人员无法访问产生这些数据的潜在概率分布。虽然熵估计是其他领域的一个充分研究的问题,但尚无对熵估计量与语言数据的有效性进行全面探索。在这项工作中,我们填补了这一空白,研究了语言分布的不同熵估计量的经验有效性。在重复两项最近的信息理论语言研究的复制中,我们发现证据表明,由于对熵估计量不佳的过度依赖,报告的效应大小过高了。最后,根据分布类型和数据可用性,我们以具体建议结束了熵估计的论文。
Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropy must typically be estimated from observed data because researchers do not have access to the underlying probability distribution that gives rise to these data. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. Finally, we end our paper with concrete recommendations for entropy estimation depending on distribution type and data availability.