论文标题
P2Anet:从乒乓球匹配视频中进行密集动作检测的数据集和基准测试
P2ANet: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos
论文作者
论文摘要
尽管深度学习已被广泛用于视频分析(例如视频分类和动作检测),但与体育视频的快速移动主题进行密集的动作检测仍然具有挑战性。 In this work, we release yet another sports video benchmark \TheName{} for \emph{\underline{P}}ing \emph{\underline{P}}ong-\emph{\underline{A}}ction detection, which consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis冠军和奥林匹克运动会。我们在专门设计的注释工具箱上与一批乒乓球专业人士和裁判员合作,以获取出现在数据集中的每个乒乓球动作的细粒度动作标签(在14个类中),并制定了两组动作检测问题 - \ emph {Action notization} and \ emph {Action nocalization}和\ emph {行动认识}。我们使用\ thename {}评估了许多常见的动作识别(例如,TSM,TSN,Videme Swintransformer和Slowfaster)和动作定位模型(例如BSN,BSN,BSN ++,BMN,TCANET),使用\ thename {}在各种设置下,使用\ thename {}。这些模型只能在AR-AN曲线下实现48%的区域,以进行本地化和82 \%的识别能力,因为乒乓球的动作很稠密,但广播视频仅为25 fps。结果证实\ thename {}仍然是一项具有挑战性的任务,可以用作从视频中进行密集动作检测的特殊基准。
While deep learning has been widely used for video analytics, such as video classification and action detection, dense action detection with fast-moving subjects from sports videos is still challenging. In this work, we release yet another sports video benchmark \TheName{} for \emph{\underline{P}}ing \emph{\underline{P}}ong-\emph{\underline{A}}ction detection, which consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads. We work with a crew of table tennis professionals and referees on a specially designed annotation toolbox to obtain fine-grained action labels (in 14 classes) for every ping-pong action that appeared in the dataset, and formulate two sets of action detection problems -- \emph{action localization} and \emph{action recognition}. We evaluate a number of commonly-seen action recognition (e.g., TSM, TSN, Video SwinTransformer, and Slowfast) and action localization models (e.g., BSN, BSN++, BMN, TCANet), using \TheName{} for both problems, under various settings. These models can only achieve 48\% area under the AR-AN curve for localization and 82\% top-one accuracy for recognition since the ping-pong actions are dense with fast-moving subjects but broadcasting videos are with only 25 FPS. The results confirm that \TheName{} is still a challenging task and can be used as a special benchmark for dense action detection from videos.