评估视觉问题难度的熵聚类方法

论文标题

评估视觉问题难度的熵聚类方法

An Entropy Clustering Approach for Assessing Visual Question Difficulty

论文作者

Terao, Kento, Tamaki, Toru, Raytchev, Bisser, Kaneda, Kazufumi, Satoh, Shun'ichi

论文摘要

我们提出了一种新颖的方法，可以在没有直接监督或对困难的注释的情况下确定视觉问题回答（VQA）的难度。先前的工作考虑了人类注释者的基础答案的多样性。相反，我们根据多个不同VQA模型的行为分析了视觉问题的难度。我们建议通过三个不同模型获得的预测答案分布的熵值：一种基线方法，该方法将作为输入图像和问题采用，两个仅作为输入图像和仅提出问题的变体。我们使用简单的K-均值来聚集VQA V2验证集的视觉问题。然后，我们使用最先进的方法来确定每个集群的答案分布的准确性和熵。提出的方法的一个好处是，不需要对难度的注释，因为每个集群的准确性反映了属于它的视觉问题的难度。我们的方法可以识别出难以通过最新方法正确回答的困难视觉问题的集群。对VQA V2数据集的详细分析表明，1）所有方法表明，最困难的群集的性能不佳（大约10 \％的准确性），2）随着群集难度的增加，不同方法预测的答案开始有所不同，而3）群集熵的值与群集的准确度高度相关。我们表明，我们的方法具有能够通过将其分配给其中一个簇来评估视觉问题的难度（\ ie，VQA V2的测试集）。我们希望这可以刺激研究和新算法的新方向发展。

We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10\% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (\ie, the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题