论文标题
您只需要言语吗?语言作为人类相似性判断的近似
Words are all you need? Language as an approximation for human similarity judgments
论文作者
论文摘要
人类的相似性判断是基于对比度学习,信息检索和模型对准技术的机器学习应用程序的强大监督信号,但是收集人类相似性判断的经典方法太昂贵了,无法在大规模上使用。最近的方法建议使用预先训练的深层神经网络(DNN)近似人类的相似性,但是在某些领域(例如,医学图像,低资源语言)可能无法使用预训练的DNN,并且在近似人类相似性方面的性能尚未得到广泛的测试。我们对三个领域的611个预训练模型进行了评估 - 图像,音频,视频 - 发现人类相似性判断和预训练的DNN之间的性能差距很大。为了解决这一差距,我们根据语言提出了一类新的相似性近似方法。为了收集这些新方法所需的语言数据,我们还开发了并验证了一种新颖的自适应标签收集管道。我们发现,我们提出的基于语言的方法在人类判断的数量上比经典方法更便宜,但仍然可以改善基于DNN的方法的性能。最后,我们还开发了“堆叠”方法,将语言嵌入与DNN嵌入结合在一起,并发现这些方法始终如一地为我们所有三种模式的人类相似性提供了最佳的近似值。根据这项综合研究的结果,我们为有兴趣收集或近似人类相似性数据的研究人员提供了简洁的指南。为了伴随本指南,我们还发布了所有相似性和语言数据,总共有206,339个人类判断,这些判断是我们在实验中收集的,以及所有建模结果的详细分解。
Human similarity judgments are a powerful supervision signal for machine learning applications based on techniques such as contrastive learning, information retrieval, and model alignment, but classical methods for collecting human similarity judgments are too expensive to be used at scale. Recent methods propose using pre-trained deep neural networks (DNNs) to approximate human similarity, but pre-trained DNNs may not be available for certain domains (e.g., medical images, low-resource languages) and their performance in approximating human similarity has not been extensively tested. We conducted an evaluation of 611 pre-trained models across three domains -- images, audio, video -- and found that there is a large gap in performance between human similarity judgments and pre-trained DNNs. To address this gap, we propose a new class of similarity approximation methods based on language. To collect the language data required by these new methods, we also developed and validated a novel adaptive tag collection pipeline. We find that our proposed language-based methods are significantly cheaper, in the number of human judgments, than classical methods, but still improve performance over the DNN-based methods. Finally, we also develop `stacked' methods that combine language embeddings with DNN embeddings, and find that these consistently provide the best approximations for human similarity across all three of our modalities. Based on the results of this comprehensive study, we provide a concise guide for researchers interested in collecting or approximating human similarity data. To accompany this guide, we also release all of the similarity and language data, a total of 206,339 human judgments, that we collected in our experiments, along with a detailed breakdown of all modeling results.