弱监督的行动本地化与期望最大化的多企业学习

论文标题

弱监督的行动本地化与期望最大化的多企业学习

Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

论文作者

Luo, Zhekun, Guillory, Devin, Shi, Baifeng, Ke, Wei, Wan, Fang, Darrell, Trevor, Xu, Huijuan

论文摘要

弱监督的动作本地化需要培训模型，以仅给定视频级别的动作标签将视频中的动作段定位。它可以在多个实例学习（MIL）框架下解决，其中包（视频）包含多个实例（动作段）。由于只有袋子的标签是已知的，因此主要的挑战是分配袋中的关键实例以触发袋子的标签。大多数以前的模型都采用基于注意的方法应用注意力来从实例中产生包的表示形式，然后通过袋子的分类进行训练。但是，这些模型隐含地违反了MIL假设，即负袋中的实例应统一负面。在这项工作中，我们将关键实例分配为隐藏的变量，并采用期望 - 最大化（EM）框架。我们得出了两个伪标记的生成方案，以建模E和M过程，并迭代地优化了可能性下限。我们表明，我们的EM-MIL方法更准确地模拟了学习目标和MIL假设。它可以在两个标准基准Thumos14和ActivityNet1.2上实现最先进的性能。

Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger the bag's label. Most previous models use attention-based approaches applying attentions to generate the bag's representation from instances, and then train it via the bag's classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题