论文标题
BP-IM2COL:隐式IM2COL支持AI反向传播的收缩期数组
BP-Im2col: Implicit Im2col Supporting AI Backpropagation on Systolic Arrays
论文作者
论文摘要
最先进的基于收缩期阵列的加速器采用传统的IM2COL算法来加速卷积层的推断。但是,传统的IM2COL不能有效地支持AI反向传播。卷积层中的反向传播涉及进行转置卷积和扩张卷积,这通常会在特征图或内核中引入大量零空间。零空间数据的重组会干扰训练的连续性,并在芯片和片上存储,访问和性能方面产生了其他不可忽略的开销。由于很少提出用于反向传播的对策,因此我们提出了BP-IM2Col,这是一种新型的IM2COL算法,用于AI反向传播,并在TPU样加速器上在RTL中实现它。关于TPU样加速器的实验表明,BP-IM2COL平均将返回流质运行时减少了34.9%,并将芯片内存储器的带宽和片上缓冲区的带宽降低至少22.7%和70.6%,而基线加速器采用了传统的IM2CL。它进一步将返回过程中的额外存储开销降低了至少74.78%。
State-of-the-art systolic array-based accelerators adopt the traditional im2col algorithm to accelerate the inference of convolutional layers. However, traditional im2col cannot efficiently support AI backpropagation. Backpropagation in convolutional layers involves performing transposed convolution and dilated convolution, which usually introduces plenty of zero-spaces into the feature map or kernel. The zero-space data reorganization interfere with the continuity of training and incur additional and non-negligible overhead in terms of off- and on-chip storage, access and performance. Since countermeasures for backpropagation are rarely proposed, we propose BP-im2col, a novel im2col algorithm for AI backpropagation, and implement it in RTL on a TPU-like accelerator. Experiments on TPU-like accelerator indicate that BP-im2col reduces the backpropagation runtime by 34.9% on average, and reduces the bandwidth of off-chip memory and on-chip buffers by at least 22.7% and 70.6% respectively, over a baseline accelerator adopting the traditional im2col. It further reduces the additional storage overhead in the backpropagation process by at least 74.78%.