论文标题
通用单词移位图:一种可视化和解释文本之间成对比较的方法
Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts
论文作者
论文摘要
计算文本分析中的一个常见任务是量化两个语料库根据单词频率,情感或信息内容等测量值的不同。但是,将文本的丰富故事崩溃成一个数字通常是概念上的危险,并且很难自信地解释有趣或意外的文本模式,而不会引起人们对数据伪像或测量有效性的关注。为了更好地捕获文本之间的细粒度差异,我们介绍了通用的单词移位图,可视化效果产生有意义且可解释的摘要,即单个单词如何促进两种文本之间的变化,以供可以表达为加权平均值。我们表明,该框架自然涵盖了许多最常用的方法,用于比较文本,包括相对频率,字典得分和基于熵的措施,例如Kullback-Leibler和Jensen-Shannon Diverencence。通过几个案例研究,我们证明了如何在范围内灵活地应用一般的单词移位图,以进行诊断研究,假设产生和实质性解释。通过将详细的镜头分为语料库之间的文本变化,通用的单词移动图可帮助计算社会科学家,数字人文主义者和其他文本分析从业人员时尚更强大的科学叙事。
A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences. Through several case studies, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.