基于硬件有效模板的Deep CNNS加速器设计

论文标题

基于硬件有效模板的Deep CNNS加速器设计

Hardware-Efficient Template-Based Deep CNNs Accelerator Design

论文作者

Alhussain, Azzam, Lin, Mingjie

论文摘要

边缘设备上的卷积神经网络（CNN）的加速度最近在图像分类和对象检测应用中取得了出色的性能。本文提出了一种高效且可扩展的基于CNN的SOC-FPGA加速器设计，该设计采用16位固定点量化和目标硬件规范的预训练重量，以生成能够实现更高性能与资源利用权衡取舍的优化模板。该模板分析了计算工作负载，数据依赖关系以及外部存储器带宽，并利用了循环瓷砖转换以及数据流建模，以将卷积和完全连接的层转换为输入特征和输出特征映射之间的向量乘法，从而导致单个计算单元进行芯片。此外，在Alexnet，VGG16和LENET网络中检查了加速器，并以200 MHz运行，峰值性能为230 GOP/s，具体取决于Zynq板以及在模拟和合成过程中对不同计算单位配置的状态空间探索。最后，我们提出的方法基准了Ultra96上的先前发展，以进行更高的性能测量。

Acceleration of Convolutional Neural Network (CNN) on edge devices has recently achieved a remarkable performance in image classification and object detection applications. This paper proposes an efficient and scalable CNN-based SoC-FPGA accelerator design that takes pre-trained weights with a 16-bit fixed-point quantization and target hardware specification to generate an optimized template capable of achieving higher performance versus resource utilization trade-off. The template analyzed the computational workload, data dependency, and external memory bandwidth and utilized loop tiling transformation along with dataflow modeling to convert convolutional and fully connected layers into vector multiplication between input and output feature maps, which resulted in a single compute unit on-chip. Furthermore, the accelerator was examined among AlexNet, VGG16, and LeNet networks and ran at 200-MHz with a peak performance of 230 GOP/s depending on ZYNQ boards and state-space exploration of different compute unit configurations during simulation and synthesis. Lastly, our proposed methodology was benchmarked against the previous development on Ultra96 for higher performance measurement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题