论文标题

利用GPU张量芯进行双精度欧几里得距离计算

Leveraging GPU Tensor Cores for Double Precision Euclidean Distance Calculations

论文作者

Gallet, Benoit, Gowanlock, Michael

论文摘要

张量芯(TCS)是一种特定于应用的集成电路(ASIC),是图形处理单元(GPU)架构的最新补充。因此,TC是有目的地设计的,可大大提高矩阵多功能(MMA)操作的性能。尽管对机器学习和密切相关的领域进行了大量研究,而它们的高效率是不可否认的,但MMA操作并不是这些领域独有的。更普遍地,与其他通用核心相比,任何可以表达为MMA操作的计算都可以利用TC,并可能受益于其较高的计算吞吐量,例如NVIDIA GPU上的CUDA核心。在本文中,我们提出了第一个双重精度(FP64)欧几里得距离计算算法,该算法表示为MMA操作,以利用NVIDIA GPU上的TCS,而不是更常用的CUDA核心。为了证明欧几里得距离可以在现实世界应用中加速,我们在距离相似性自加入问题上评估了我们提出的TC算法,因为该算法中最强化的一部分是在多维空间中计算距离的最密集的部分。我们发现,在CUDA核心算法上使用张量核心算法所获得的性能增长很小取决于数据集的大小和分布,但很大程度上取决于数据维度。 Overall, TCs are a compelling alternative to CUDA cores, particularly when the data dimensionality is low ($\leq{4}$), as we achieve an average speedup of $1.28\times$ and up to $2.23\times$ against a state-of-the-art GPU distance similarity self-join algorithm.此外,由于本文是最早探索TC用于FP64通用计算的人之一,因此未来的研究是有希望的。

Tensor cores (TCs) are a type of Application-Specific Integrated Circuit (ASIC) and are a recent addition to Graphics Processing Unit (GPU) architectures. As such, TCs are purposefully designed to greatly improve the performance of Matrix Multiply-Accumulate (MMA) operations. While TCs are heavily studied for machine learning and closely related fields, where their high efficiency is undeniable, MMA operations are not unique to these fields. More generally, any computation that can be expressed as MMA operations can leverage TCs, and potentially benefit from their higher computational throughput compared to other general-purpose cores, such as CUDA cores on Nvidia GPUs. In this paper, we propose the first double precision (FP64) Euclidean distance calculation algorithm, which is expressed as MMA operations to leverage TCs on Nvidia GPUs, rather than the more commonly used CUDA cores. To show that the Euclidean distance can be accelerated in a real-world application, we evaluate our proposed TC algorithm on the distance similarity self-join problem, as the most computationally intensive part of the algorithm consists of computing distances in a multi-dimensional space. We find that the performance gain from using the tensor core algorithm over the CUDA core algorithm depends weakly on the dataset size and distribution, but is strongly dependent on data dimensionality. Overall, TCs are a compelling alternative to CUDA cores, particularly when the data dimensionality is low ($\leq{4}$), as we achieve an average speedup of $1.28\times$ and up to $2.23\times$ against a state-of-the-art GPU distance similarity self-join algorithm. Furthermore, because this paper is among the first to explore the use of TCs for FP64 general-purpose computation, future research is promising.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源