论文标题
Scilander:映射科学新闻格局
SciLander: Mapping the Scientific News Landscape
论文作者
论文摘要
COVID-19的大流行推动了错误信息在社交媒体和整个网络上的传播。该现象称为“流行病”,通过将看似科学的和技术元素大量引入误导性内容,将信息真实性和信任的挑战带入了新的高度。尽管现有在建模和预测错误信息方面进行的工作,但具有固有的不确定性和一组不断发展的发现(例如Covid-19)的非常复杂的科学主题(例如Covid-19)提供了许多新的挑战,这些挑战不容易被现有工具解决。为了解决这些问题,我们介绍了Scilander,这是一种学习基于科学主题的新闻来源的方法。 Scilander提取了新闻来源的四个异质指标;两个通用指标,捕获(1)在来源之间复制新闻故事,以及(2)使用相同的术语来表示不同的事物(即,术语的语义转移),以及两个捕获(1)使用行话的科学指标和(2)对特定引用的立场。我们使用这些指标作为源协议的信号,对正(类似)和负(不同)样本的采样对,并将它们结合在统一的框架中,以训练无监督的新闻源嵌入,并具有三重态损失目标。我们在一个新颖的Covid-19数据集上评估了我们的方法,该数据集包含自2020年大流行以来18个月的500个来源的近100万篇新闻文章。我们的结果表明,我们的模型所学的功能优于新闻验证任务的最先进的基线方法。此外,聚类分析表明,学识渊博的表示形式编码有关这些来源的可靠性,政治倾向和党派偏见的信息。
The COVID-19 pandemic has fueled the spread of misinformation on social media and the Web as a whole. The phenomenon dubbed `infodemic' has taken the challenges of information veracity and trust to new heights by massively introducing seemingly scientific and technical elements into misleading content. Despite the existing body of work on modeling and predicting misinformation, the coverage of very complex scientific topics with inherent uncertainty and an evolving set of findings, such as COVID-19, provides many new challenges that are not easily solved by existing tools. To address these issues, we introduce SciLander, a method for learning representations of news sources reporting on science-based topics. SciLander extracts four heterogeneous indicators for the news sources; two generic indicators that capture (1) the copying of news stories between sources, and (2) the use of the same terms to mean different things (i.e., the semantic shift of terms), and two scientific indicators that capture (1) the usage of jargon and (2) the stance towards specific citations. We use these indicators as signals of source agreement, sampling pairs of positive (similar) and negative (dissimilar) samples, and combine them in a unified framework to train unsupervised news source embeddings with a triplet margin loss objective. We evaluate our method on a novel COVID-19 dataset containing nearly 1M news articles from 500 sources spanning a period of 18 months since the beginning of the pandemic in 2020. Our results show that the features learned by our model outperform state-of-the-art baseline methods on the task of news veracity classification. Furthermore, a clustering analysis suggests that the learned representations encode information about the reliability, political leaning, and partisanship bias of these sources.