用于图像数据降低的几何均匀聚类

论文标题

用于图像数据降低的几何均匀聚类

Geometrical Homogeneous Clustering for Image Data Reduction

论文作者

Mody, Shril, Thakkar, Janvi, Joshi, Devvrat, Soni, Siddharth, Patil, Rohan, Batra, Nipun

论文摘要

在本文中，我们介绍了一种早期方法的新颖变化，称为均质聚类算法，用于降低数据集大小。本文提出的方法背后的直觉是将数据集划分为均匀群集，并选择一些对准确性产生重大贡献的图像。选定的图像是训练数据的适当子集，因此是可读的。我们在基线算法RHC上提出了四个变体。第一种方法背后的直觉是，边界点有助于群集的表示。它涉及选择群集质心的最远的k和一个最近的邻居。在以下两种方法（KONCW和CWKC）中，我们介绍了簇权重的概念。它们是基于以下事实：较大的簇比较小的簇贡献更多。最终变化是GHCIDR，它根据数据分布的几何方面选择点。我们在两个深度学习模型 - 完全连接的网络（FCN）和VGG1上进行了实验。我们在三个数据集中的四个变体中进行了实验：MNIST，CIFAR10和Fashion-Mnist。我们发现，GHCIDR的最佳准确度分别为99.35％，81.10％和91.66％，培训数据降低了87.27％，32.34％和76.80％，分别为MNIST，CIFAR10和时尚。

In this paper, we present novel variations of an earlier approach called homogeneous clustering algorithm for reducing dataset size. The intuition behind the approaches proposed in this paper is to partition the dataset into homogeneous clusters and select some images which contribute significantly to the accuracy. Selected images are the proper subset of the training data and thus are human-readable. We propose four variations upon the baseline algorithm-RHC. The intuition behind the first approach, RHCKON, is that the boundary points contribute significantly towards the representation of clusters. It involves selecting k farthest and one nearest neighbour of the centroid of the clusters. In the following two approaches (KONCW and CWKC), we introduce the concept of cluster weights. They are based on the fact that larger clusters contribute more than smaller sized clusters. The final variation is GHCIDR which selects points based on the geometrical aspect of data distribution. We performed the experiments on two deep learning models- Fully Connected Networks (FCN) and VGG1. We experimented with the four variants on three datasets- MNIST, CIFAR10, and Fashion-MNIST. We found that GHCIDR gave the best accuracy of 99.35%, 81.10%, and 91.66% and a training data reduction of 87.27%, 32.34%, and 76.80% on MNIST, CIFAR10, and Fashion-MNIST respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题