驾驶场景中的认知事故预测：多模式基准测试

论文标题

驾驶场景中的认知事故预测：多模式基准测试

Cognitive Accident Prediction in Driving Scenes: A Multimodality Benchmark

论文作者

Fang, Jianwu, Li, Lei-Lei, Yang, Kuan, Zheng, Zhedong, Xue, Jianru, Chua, Tat-Seng

论文摘要

驾驶视频中的交通事故预测旨在提供事故发生的预警，并支持安全驾驶系统的决策。以前的作品通常集中在对象级环境的时空相关性上，而它们不符合固有的长尾数据分布，并且容易受到严重的环境变化的影响。在这项工作中，我们提出了一种认知事故预测（CAP）方法，该方法明确利用了人类启发的对视觉观察的文本描述的认知和驾驶员的关注以促进模型训练。特别是，文本描述为交通场景的主要环境提供了密集的语义描述指南，而驾驶员的注意力则提供了关注与安全驾驶紧密相关的关键区域的牵引力。 CAP由细心的文本转换融合模块，细心的场景上下文转移模块和驾驶员注意力指导事故预测模块制定。我们利用这些模块中的注意机制来探索事故预测的核心语义提示。为了训练帽，我们扩展了现有的自我收集的DADA-2000数据集（每帧带注释的驾驶员注意），并在事故发生前进行了其他事实文本描述。此外，我们构建了一个新的大规模基准，该基准包括11,727个野外事故视频，其中包括超过219万帧（称为Cap-Data），以及标有事实效应 - 季节 - 季节 - 内部 - 进场描述和时间事故框架标签。基于广泛的实验，与最先进的方法相比，CAP的优势得到了验证。代码，cap-data和所有结果将在\ url {https://github.com/jwfanggit/lotvs-cap}中发布。

Traffic accident prediction in driving videos aims to provide an early warning of the accident occurrence, and supports the decision making of safe driving systems. Previous works usually concentrate on the spatial-temporal correlation of object-level context, while they do not fit the inherent long-tailed data distribution well and are vulnerable to severe environmental change. In this work, we propose a Cognitive Accident Prediction (CAP) method that explicitly leverages human-inspired cognition of text description on the visual observation and the driver attention to facilitate model training. In particular, the text description provides a dense semantic description guidance for the primary context of the traffic scene, while the driver attention provides a traction to focus on the critical region closely correlating with safe driving. CAP is formulated by an attentive text-to-vision shift fusion module, an attentive scene context transfer module, and the driver attention guided accident prediction module. We leverage the attention mechanism in these modules to explore the core semantic cues for accident prediction. In order to train CAP, we extend an existing self-collected DADA-2000 dataset (with annotated driver attention for each frame) with further factual text descriptions for the visual observations before the accidents. Besides, we construct a new large-scale benchmark consisting of 11,727 in-the-wild accident videos with over 2.19 million frames (named as CAP-DATA) together with labeled fact-effect-reason-introspection description and temporal accident frame label. Based on extensive experiments, the superiority of CAP is validated compared with state-of-the-art approaches. The code, CAP-DATA, and all results will be released in \url{https://github.com/JWFanggit/LOTVS-CAP}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题