智能处理单元加速神经形态学习

论文标题

智能处理单元加速神经形态学习

Intelligence Processing Units Accelerate Neuromorphic Learning

论文作者

Sun, Pao-Sheng Vincent, Titterton, Alexander, Gopiani, Anjlee, Santos, Tim, Basu, Arindam, Lu, Wei D., Eshraghian, Jason K.

论文摘要

尖峰神经网络（SNN）在用深度学习工作负载进行推断时，在能耗和延迟方面取得了数量级的改善。错误的反向传播目前被视为训练SNN的最有效方法，但是在讽刺的转折中，当对现代图形处理单元（GPU）进行培训时，这比非加价网络变得更昂贵。 GraphCore的智能处理单元（IPU）的出现平衡了深度学习工作负载的并行性质与训练SNN时普遍存在的顺序，可重复使用和稀疏性质。 IPU通过在较小的数据块上运行单个处理线程来采用多指导多DATA（MIMD）并行性，这对于求解峰值神经元动力学方程所需的顺序，非矢量化步骤是一种自然拟合。我们提出了我们定制的SNN Python软件包的IPU优化版本，该版本通过利用低级，预编译的自定义操作来加速不规则和稀疏的数据访问模式，从而利用了细粒度的并行性，这些操作是训练SNN工作负载的特征。我们在一系列常用的尖峰神经元模型中提供了严格的性能评估，并提出了通过半精确训练进一步降低训练时间的方法。通过将顺序处理成本摊给可矢量化的人群代码，我们最终证明了将域特异性加速器与下一代神经网络集成的潜力。

Spiking neural networks (SNNs) have achieved orders of magnitude improvement in terms of energy consumption and latency when performing inference with deep learning workloads. Error backpropagation is presently regarded as the most effective method for training SNNs, but in a twist of irony, when training on modern graphics processing units (GPUs) this becomes more expensive than non-spiking networks. The emergence of Graphcore's Intelligence Processing Units (IPUs) balances the parallelized nature of deep learning workloads with the sequential, reusable, and sparsified nature of operations prevalent when training SNNs. IPUs adopt multi-instruction multi-data (MIMD) parallelism by running individual processing threads on smaller data blocks, which is a natural fit for the sequential, non-vectorized steps required to solve spiking neuron dynamical state equations. We present an IPU-optimized release of our custom SNN Python package, snnTorch, which exploits fine-grained parallelism by utilizing low-level, pre-compiled custom operations to accelerate irregular and sparse data access patterns that are characteristic of training SNN workloads. We provide a rigorous performance assessment across a suite of commonly used spiking neuron models, and propose methods to further reduce training run-time via half-precision training. By amortizing the cost of sequential processing into vectorizable population codes, we ultimately demonstrate the potential for integrating domain-specific accelerators with the next generation of neural networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题