论文标题
通用邻近数据的复杂值嵌入
Complex-valued embeddings of generic proximity data
论文作者
论文摘要
接近几乎所有机器学习方法的核心。如果输入数据作为相等长度的数值向量,欧几里得距离或希尔伯特式内部产物经常用于建模算法。在更通用的视图中,对象通过(对称的)相似性或差异度量进行比较,这可能不遵守特定的数学属性。这使许多机器学习方法无效,导致收敛问题和保证丢失,例如概括性界限。在许多情况下,首选的差异度量不是公制的,例如地球发动机距离,或者相似性度量可能不是希尔伯特空间中的简单内部产物,而是在其概括中是孔林空间。如果输入数据是非矢量的,例如文本序列,则使用基于接近度的学习,或者可以应用Ngram嵌入技术。标准嵌入式导致所需的固定长度向量编码,但昂贵,并且在保留原始数据的完整信息方面有很大的限制。作为保留替代方案的信息,我们提出了接近数据的复杂值矢量嵌入。这允许合适的机器学习算法使用这些固定长度,复杂值的向量进行进一步处理。复杂值数据可以作为复杂计算机学习算法的输入。特别是,我们解决了受监督的学习,并使用基于原型的学习的扩展。对所提出的方法进行了各种标准基准评估,并且与处理非金属或非PSD接近数据中的传统技术相比,表现出很强的性能。
Proximities are at the heart of almost all machine learning methods. If the input data are given as numerical vectors of equal lengths, euclidean distance, or a Hilbertian inner product is frequently used in modeling algorithms. In a more generic view, objects are compared by a (symmetric) similarity or dissimilarity measure, which may not obey particular mathematical properties. This renders many machine learning methods invalid, leading to convergence problems and the loss of guarantees, like generalization bounds. In many cases, the preferred dissimilarity measure is not metric, like the earth mover distance, or the similarity measure may not be a simple inner product in a Hilbert space but in its generalization a Krein space. If the input data are non-vectorial, like text sequences, proximity-based learning is used or ngram embedding techniques can be applied. Standard embeddings lead to the desired fixed-length vector encoding, but are costly and have substantial limitations in preserving the original data's full information. As an information preserving alternative, we propose a complex-valued vector embedding of proximity data. This allows suitable machine learning algorithms to use these fixed-length, complex-valued vectors for further processing. The complex-valued data can serve as an input to complex-valued machine learning algorithms. In particular, we address supervised learning and use extensions of prototype-based learning. The proposed approach is evaluated on a variety of standard benchmarks and shows strong performance compared to traditional techniques in processing non-metric or non-psd proximity data.