GraphAlt：加速CPU-FPGA异质平台上的GCN培训

论文标题

GraphAlt：加速CPU-FPGA异质平台上的GCN培训

GraphACT: Accelerating GCN Training on CPU-FPGA Heterogeneous Platforms

论文作者

Zeng, Hanqing, Prasanna, Viktor

论文摘要

图形卷积网络（GCN）已成为图形上表示学习的最新深度学习模型。由于（1）大量和不规则的数据通信以在图中传播信息，以及（2）密集的计算以沿神经网络层传播信息，因此加速对GCN的培训是一项挑战。为了应对这些挑战，我们设计了一个新颖的加速器，用于培训CPU-FPGA异质系统的GCN，并通过合并多种算法 - 建筑学合并合并。我们首先分析各种GCN培训算法的计算和通信特性，并选择一种基于子图的算法，该算法非常适合硬件执行。为了优化子图中的特征传播，我们提出了基于图理论方法的轻巧预处理步骤。在CPU上执行的此类预处理可显着降低内存访问要求和在FPGA上执行的计算。为了加速GCN层的重量更新，我们提出了一种基于收缩期阵列的设计，以进行有效的并行化。我们将上述优化集成到完整的硬件管道中，并通过准确的性能建模来分析其负载平衡和资源利用。我们在由40核Xeon服务器托管的Xilinx Alveo U200板上评估我们的设计。在三个大图上，与在多核平台上的最新实施相比，我们实现了训练的速度较小的训练速度。

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art deep learning model for representation learning on graphs. It is challenging to accelerate training of GCNs, due to (1) substantial and irregular data communication to propagate information within the graph, and (2) intensive computation to propagate information along the neural network layers. To address these challenges, we design a novel accelerator for training GCNs on CPU-FPGA heterogeneous systems, by incorporating multiple algorithm-architecture co-optimizations. We first analyze the computation and communication characteristics of various GCN training algorithms, and select a subgraph-based algorithm that is well suited for hardware execution. To optimize the feature propagation within subgraphs, we propose a lightweight pre-processing step based on a graph theoretic approach. Such pre-processing performed on the CPU significantly reduces the memory access requirements and the computation to be performed on the FPGA. To accelerate the weight update in GCN layers, we propose a systolic array based design for efficient parallelization. We integrate the above optimizations into a complete hardware pipeline, and analyze its load-balance and resource utilization by accurate performance modeling. We evaluate our design on a Xilinx Alveo U200 board hosted by a 40-core Xeon server. On three large graphs, we achieve an order of magnitude training speedup with negligible accuracy loss, compared with state-of-the-art implementation on a multi-core platform.

下载PDF全文

下载文献需遵守相关版权规定

论文标题