卫星图像中的可发现性：好句子值得一千张图片

论文标题

卫星图像中的可发现性：好句子值得一千张图片

Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures

论文作者

Noever, David, Regian, Wes, Ciolino, Matt, Kalin, Josh, Hambrick, Dom, Blankenship, Kaye

论文摘要

小型卫星星座每天全球覆盖地球的地质，但图像丰富依赖于自动化的关键任务，例如变更检测或特征搜索。例如，要从原始像素中提取文本注释需要两个相关的机器学习模型，一个用于分析开销图像，另一个用于生成描述性标题。我们在先前最大的卫星图像标题基准上评估了七个模型。我们将标记的图像样本扩展了五倍，然后增强，校正并修剪词汇以接近最小的最小字（最小单词，最大描述）。该结果与先前具有大型预训练的图像模型的工作相比，但在不牺牲总体准确性的情况下（用日志熵损失进行测量），模型大小降低了一百倍。这些较小的模型提供了新的部署机会，尤其是当推到边缘处理器，板载卫星或分布式地面站时。为了量化标题的描述性，我们介绍了一种新型的多级混乱或错误矩阵，以对人体标记的测试数据和从未标记的图像进行评分，这些图像包括边界框检测，但缺乏完整的句子字幕。这项工作表明了未来的字幕策略，尤其是可以丰富班级覆盖范围并减少以色彩为中心和相邻形容词的策略（“绿色”，“接近”，“之间”等）。许多现代语言变形金刚通过庞大的在线语料库培训获得了新颖和可剥削的模型。一个有趣但简单的例子可能会学习风与波之间的一词关联，从而丰富了海滩场景，而不仅仅是彩色描述，否则可以从原始像素访问而无需文本注释。

Small satellite constellations provide daily global coverage of the earth's landmass, but image enrichment relies on automating key tasks like change detection or feature searches. For example, to extract text annotations from raw pixels requires two dependent machine learning models, one to analyze the overhead image and the other to generate a descriptive caption. We evaluate seven models on the previously largest benchmark for satellite image captions. We extend the labeled image samples five-fold, then augment, correct and prune the vocabulary to approach a rough min-max (minimum word, maximum description). This outcome compares favorably to previous work with large pre-trained image models but offers a hundred-fold reduction in model size without sacrificing overall accuracy (when measured with log entropy loss). These smaller models provide new deployment opportunities, particularly when pushed to edge processors, on-board satellites, or distributed ground stations. To quantify a caption's descriptiveness, we introduce a novel multi-class confusion or error matrix to score both human-labeled test data and never-labeled images that include bounding box detection but lack full sentence captions. This work suggests future captioning strategies, particularly ones that can enrich the class coverage beyond land use applications and that lessen color-centered and adjacency adjectives ("green", "near", "between", etc.). Many modern language transformers present novel and exploitable models with world knowledge gleaned from training from their vast online corpus. One interesting, but easy example might learn the word association between wind and waves, thus enriching a beach scene with more than just color descriptions that otherwise might be accessed from raw pixels without text annotation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题