Audeo：无声表演视频的音频生成

论文标题

Audeo：无声表演视频的音频生成

Audeo: Audio Generation for a Silent Performance Video

论文作者

Su, Kun, Liu, Xiulong, Shlizerman, Eli

论文摘要

我们提出了一个新型系统，该系统是音乐家弹钢琴的输入视频帧，并为该视频创造音乐。从视觉提示中产生音乐是一个具有挑战性的问题，目前尚不清楚这是否是一个可以实现的目标。我们在这项工作中的主要目的是探索这种转换的合理性，并确定能够将声音与视觉事件结合的提示和组成部分。为了实现转换，我们构建了一个完整的管道，称为“ \ textit {audeo}”，包含三个组件。我们首先将键盘的视频帧和音乐家手动移动转换为原始的机械音乐符号表示钢琴弹奏（滚动）的每个视频框架，该视频框架代表每个时间步骤都按下的键。然后，我们通过包括时间相关性来调整卷以适合音频合成。事实证明，这对于有意义的音频产生至关重要。作为最后一步，我们实现了MIDI合成器来生成逼真的音乐。 \ textIt {audeo}仅使用几个设置约束将视频平稳而清晰地转换为音频。我们在“野生”钢琴表演视频上评估\ textit {audeo}，并获得其生成的音乐具有合理的音频质量，并且可以通过流行音乐识别软件获得高精度的成功识别。

We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video. Generation of music from visual cues is a challenging problem and it is not clear whether it is an attainable goal at all. Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events. To achieve the transformation we built a full pipeline named `\textit{Audeo}' containing three components. We first translate the video frames of the keyboard and the musician hand movements into raw mechanical musical symbolic representation Piano-Roll (Roll) for each video frame which represents the keys pressed at each time step. We then adapt the Roll to be amenable for audio synthesis by including temporal correlations. This step turns out to be critical for meaningful audio generation. As a last step, we implement Midi synthesizers to generate realistic music. \textit{Audeo} converts video to audio smoothly and clearly with only a few setup constraints. We evaluate \textit{Audeo} on `in the wild' piano performance videos and obtain that their generated music is of reasonable audio quality and can be successfully recognized with high precision by popular music identification software.

下载PDF全文

下载文献需遵守相关版权规定

论文标题