论文标题
人脑中的视觉表示与大语言模型保持一致
Visual representations in the human brain are aligned with large language models
论文作者
论文摘要
人脑从视觉输入中提取复杂的信息,包括对象,它们的空间和语义相互关系及其与环境的相互作用。但是,研究此信息的定量方法仍然难以捉摸。在这里,我们测试了大语模型(LLMS)中编码的上下文信息是否有益于建模大脑从自然场景中提取的复杂视觉信息。我们表明,场景标题的LLM嵌入方式成功地表征了通过查看自然场景引起的大脑活动。该映射捕获了不同大脑区域的选择性,并且可以从大脑活动中重建准确的场景标题。然后,我们使用仔细控制的模型比较,然后继续证明LLM表示的准确性与大脑表示的准确性源于LLM的能力集成在单个单词传达的场景字幕中包含的复杂信息的能力。最后,我们训练深层神经网络模型以将图像输入转换为LLM表示。值得注意的是,这些网络学习的表示形式比大量最先进的替代模型更好地与大脑表示能力保持一致,尽管接受了较少的数据训练。总体而言,我们的结果表明,LLM的场景字幕嵌入提供了一种代表性的格式,该格式说明了大脑从视觉输入中提取的复杂信息。
The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative approach for studying this information remains elusive. Here, we test whether the contextual information encoded in large language models (LLMs) is beneficial for modelling the complex visual information extracted by the brain from natural scenes. We show that LLM embeddings of scene captions successfully characterise brain activity evoked by viewing the natural scenes. This mapping captures selectivities of different brain areas, and is sufficiently robust that accurate scene captions can be reconstructed from brain activity. Using carefully controlled model comparisons, we then proceed to show that the accuracy with which LLM representations match brain representations derives from the ability of LLMs to integrate complex information contained in scene captions beyond that conveyed by individual words. Finally, we train deep neural network models to transform image inputs into LLM representations. Remarkably, these networks learn representations that are better aligned with brain representations than a large number of state-of-the-art alternative models, despite being trained on orders-of-magnitude less data. Overall, our results suggest that LLM embeddings of scene captions provide a representational format that accounts for complex information extracted by the brain from visual inputs.