论文标题
TCIM:三角计算加速器的加速
TCIM: Triangle Counting Acceleration With Processing-In-MRAM Architecture
论文作者
论文摘要
三角计数(TC)是图形分析中的一个基本问题,并且发现了许多应用程序,它激发了传统计算平台(例如GPU和FPGA)中的许多TC加速解决方案。但是,这些方法患有带宽瓶颈,因为TC计算涉及大量数据传输。在本文中,我们建议通过使用新兴过程中的MRAM(PIM)体系结构设计TC加速器来克服这一挑战。我们方法背后的真正创新是一种新颖的方法,可以用位逻辑操作(例如\ texttt {and})执行TC,而不是诸如矩阵计算之类的传统方法。这实现了TC计算的有效内存实现,我们在本文中使用计算自旋转移扭矩磁性RAM(STT-MRAM)阵列进行了证明。此外,我们开发了定制的图形切片和映射技术,以加快计算并减少能耗。我们使用设备到架构共同模拟框架来验证我们提出的TC加速器。结果表明,我们的数据映射策略可以减少计算的$ 99.99 \%$,而内存\ texttt {write}操作的$ 72 \%$。与现有的GPU或FPGA加速器相比,我们的内存加速器分别达到了$ 9 \ times $ $和$ 23.4 \ times $的加速度,并且比FPGA加速器的能源效率提高了$ 20.6 \ times $。
Triangle counting (TC) is a fundamental problem in graph analysis and has found numerous applications, which motivates many TC acceleration solutions in the traditional computing platforms like GPU and FPGA. However, these approaches suffer from the bandwidth bottleneck because TC calculation involves a large amount of data transfers. In this paper, we propose to overcome this challenge by designing a TC accelerator utilizing the emerging processing-in-MRAM (PIM) architecture. The true innovation behind our approach is a novel method to perform TC with bitwise logic operations (such as \texttt{AND}), instead of the traditional approaches such as matrix computations. This enables the efficient in-memory implementations of TC computation, which we demonstrate in this paper with computational Spin-Transfer Torque Magnetic RAM (STT-MRAM) arrays. Furthermore, we develop customized graph slicing and mapping techniques to speed up the computation and reduce the energy consumption. We use a device-to-architecture co-simulation framework to validate our proposed TC accelerator. The results show that our data mapping strategy could reduce $99.99\%$ of the computation and $72\%$ of the memory \texttt{WRITE} operations. Compared with the existing GPU or FPGA accelerators, our in-memory accelerator achieves speedups of $9\times$ and $23.4\times$, respectively, and a $20.6\times$ energy efficiency improvement over the FPGA accelerator.