论文标题
LIDARCLIP或:我如何学会与点云交谈
LidarCLIP or: How I Learned to Talk to Point Clouds
论文作者
论文摘要
连接文本和图像的研究最近看到了几个突破性,具有剪辑,dall-e 2和稳定扩散等模型。但是,由于缺乏文本LIDAR数据集所禁止的文本和其他视觉方式(例如LiDAR数据)之间的连接受到了较少的关注。在这项工作中,我们提出了LidarClip,这是从汽车点云到预先存在的夹具嵌入空间的映射。使用Image-Lidar对,我们用图像剪辑嵌入来监督点云编码器,从而有效地将文本和LiDAR数据与图像域将其作为中介机构联系起来。我们通过证明基于激光雷达的检索通常与基于图像的检索相当,但具有互补的优势和劣势来证明Lidarclip的有效性。通过结合图像和激光雷德功能,我们可以对单模式方法进行改进,并在不良传感器条件下进行有针对性的搜索检测方案。我们还探索了零拍的分类,并表明LidarClip的表现优于现有的尝试将夹子用于点云的尝试。最后,我们利用与剪辑的兼容性来探索一系列应用程序,例如点云字幕和激光图像生成,而无需任何其他培训。代码和预培训模型可在https://github.com/atonderski/lidarclip上找到。
Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models are available at https://github.com/atonderski/lidarclip.