部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

Unifying Tracking and Image-Video Object Detection

论文作者

Liu, Peirong, Wang, Rui, Zhang, Pengchuan, Poursaeed, Omid, Zhou, Yipin, Cao, Xuefei, Roy, Sreya Dutta, Shah, Ashish, Lim, Ser-Nam

论文摘要

异议检测（OD）一直是计算机视觉中最基本的任务之一。深度学习的最新发展通过基于学习的数据驱动方法将图像OD的性能推向了新的高度。另一方面，视频OD的探索程度较低，这主要是由于昂贵的数据注释需求更高。同时，需要有关轨道身份和时空轨迹的推理的多对象跟踪（MOT）与视频OD共享相似的精神。但是，大多数MOT数据集都是特定于类的（例如，仅是人向内的），这限制了模型在其他对象上执行跟踪的灵活性。我们提出了TRIVD（跟踪和图像视频检测），这是一个在一个端到端模型中统一图像OD，视频OD和MOT的第一个框架。为了处理跨数据集的类别标签的差异和语义重叠，Trivd通过视觉文本对齐来制定检测/跟踪作为接地，以及有关对象类别的原因。统一配方可以实现交叉数据，多任务训练，从而使TRIVD具有利用框架级特征，视频级时空关系以及跟踪身份关联的能力。通过这样的联合培训，我们现在可以将知识从OD数据扩展到具有更丰富的对象类别注释，以MOT并实现零拍跟踪功能。实验表明，在所有图像/视频OD和MOT任务上，多任务共同训练的TRIV优于单任务基准。我们进一步将第一个基线设置为零摄影跟踪的新任务。

Objection detection (OD) has been one of the most fundamental tasks in computer vision. Recent developments in deep learning have pushed the performance of image OD to new heights by learning-based, data-driven approaches. On the other hand, video OD remains less explored, mostly due to much more expensive data annotation needs. At the same time, multi-object tracking (MOT) which requires reasoning about track identities and spatio-temporal trajectories, shares similar spirits with video OD. However, most MOT datasets are class-specific (e.g., person-annotated only), which constrains a model's flexibility to perform tracking on other objects. We propose TrIVD (Tracking and Image-Video Detection), the first framework that unifies image OD, video OD, and MOT within one end-to-end model. To handle the discrepancies and semantic overlaps of category labels across datasets, TrIVD formulates detection/tracking as grounding and reasons about object categories via visual-text alignments. The unified formulation enables cross-dataset, multi-task training, and thus equips TrIVD with the ability to leverage frame-level features, video-level spatio-temporal relations, as well as track identity associations. With such joint training, we can now extend the knowledge from OD data, that comes with much richer object category annotations, to MOT and achieve zero-shot tracking capability. Experiments demonstrate that multi-task co-trained TrIVD outperforms single-task baselines across all image/video OD and MOT tasks. We further set the first baseline on the new task of zero-shot tracking.

下载PDF全文

下载文献需遵守相关版权规定

论文标题