加速多项式乘法用于GPU上的同形加密

论文标题

加速多项式乘法用于GPU上的同形加密

Accelerating Polynomial Multiplication for Homomorphic Encryption on GPUs

论文作者

Shivdikar, Kaustubh, Jonatan, Gilbert, Mora, Evelio, Livesay, Neal, Agrawal, Rashmi, Joshi, Ajay, Abellan, Jose, Kim, John, Kaeli, David

论文摘要

同态加密（HE）使用户可以将敏感数据的存储和计算安全外包给不受信任的服务器。他不仅为云系统中的安全性提供了有吸引力的解决方案，而且基于格子的HE系统也被认为对量子计算机的攻击具有抵抗力。但是，目前的实施遭受了高潜伏期的损失。对于基于晶格的他，他对现实世界系统变得可行，因此必须高效的关键瓶颈（尤其是多项式乘法）。在本文中，我们介绍了基于GPU的多项式乘法实现的特征。我们从模块化还原技术的调查开始，并分析广泛使用的Barrett模块化还原算法的几种变体。然后，我们提出了一个针对GPU上64位整数词进行优化的模块化变体，在现有的可比较实现上获得了1.8倍的速度。接下来，我们将探索以下针对以优化延迟和吞吐量的多项式乘法的GPU特异性改进：1）我们提出了NTT的2D混合radix，多块实现NTT，从而在先前的最新面前导致1.85倍的平均加速。 2）我们探索旨在减少冗余内存访问的共享内存优化，将加速度进一步提高1.2倍。 3）最后，我们将Hadamard产品与NTT的相邻阶段融合在一起，将Twiddle因子存储器足迹降低了50％。通过结合NTT优化，我们在先前最先进的CPU和NTT内核实现的总体加速度为123.13倍和2.37倍。

Homomorphic Encryption (HE) enables users to securely outsource both the storage and computation of sensitive data to untrusted servers. Not only does HE offer an attractive solution for security in cloud systems, but lattice-based HE systems are also believed to be resistant to attacks by quantum computers. However, current HE implementations suffer from prohibitively high latency. For lattice-based HE to become viable for real-world systems, it is necessary for the key bottlenecks - particularly polynomial multiplication - to be highly efficient. In this paper, we present a characterization of GPU-based implementations of polynomial multiplication. We begin with a survey of modular reduction techniques and analyze several variants of the widely-used Barrett modular reduction algorithm. We then propose a modular reduction variant optimized for 64-bit integer words on the GPU, obtaining a 1.8x speedup over the existing comparable implementations. Next, we explore the following GPU-specific improvements for polynomial multiplication targeted at optimizing latency and throughput: 1) We present a 2D mixed-radix, multi-block implementation of NTT that results in a 1.85x average speedup over the previous state-of-the-art. 2) We explore shared memory optimizations aimed at reducing redundant memory accesses, further improving speedups by 1.2x. 3) Finally, we fuse the Hadamard product with neighboring stages of the NTT, reducing the twiddle factor memory footprint by 50%. By combining our NTT optimizations, we achieve an overall speedup of 123.13x and 2.37x over the previous state-of-the-art CPU and GPU implementations of NTT kernels, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题