论文标题
Wikipedia阅读器导航:合成数据足够时
Wikipedia Reader Navigation: When Synthetic Data Is Enough
论文作者
论文摘要
每天有数百万人阅读维基百科。当使用超链接浏览可用主题的大量空间时,读者描述了文章网络上的轨迹。了解这些导航模式对于更好地满足读者的需求并解决结构性偏见和知识差距至关重要。但是,由于承诺通过不存储或共享潜在的敏感数据来保护读者的隐私,因此缺乏公开可用数据的系统性研究受到缺乏公开可用数据的阻碍。在本文中,我们询问:通过使用公共可用资源(尤其是Wikipedia ClickStream数据),Wikipedia阅读器的导航如何近似?我们在8个Wikipedia语言版本的6个分析中系统地量化了实际导航序列和合成序列之间的差异。总体而言,我们发现真实和合成序列之间的差异在统计学上是显着的,但效应大小较小,通常低于10%。这构成了Wikipedia ClickStream数据作为公共资源的实用性的定量证据:ClickStream数据可以在Wikipedia上紧密捕获阅读器导航,并为依靠读取器数据的最实用的下游应用程序提供足够的近似值。更广泛地说,这项研究提供了一个示例,说明了点击流式数据通常可以在在线平台上进行对用户导航的研究,同时保护用户的隐私。
Every day millions of people read Wikipedia. When navigating the vast space of available topics using hyperlinks, readers describe trajectories on the article network. Understanding these navigation patterns is crucial to better serve readers' needs and address structural biases and knowledge gaps. However, systematic studies of navigation on Wikipedia are hindered by a lack of publicly available data due to the commitment to protect readers' privacy by not storing or sharing potentially sensitive data. In this paper, we ask: How well can Wikipedia readers' navigation be approximated by using publicly available resources, most notably the Wikipedia clickstream data? We systematically quantify the differences between real navigation sequences and synthetic sequences generated from the clickstream data, in 6 analyses across 8 Wikipedia language versions. Overall, we find that the differences between real and synthetic sequences are statistically significant, but with small effect sizes, often well below 10%. This constitutes quantitative evidence for the utility of the Wikipedia clickstream data as a public resource: clickstream data can closely capture reader navigation on Wikipedia and provides a sufficient approximation for most practical downstream applications relying on reader data. More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users' privacy.