Zynqnet：FPGA加速嵌入式卷积神经网络

论文标题

Zynqnet：FPGA加速嵌入式卷积神经网络

ZynqNet: An FPGA-Accelerated Embedded Convolutional Neural Network

论文作者

Gschwend, David

论文摘要

图像理解正在成为从医疗诊断到自动驾驶汽车的更多应用中的重要功能。许多应用程序对嵌入式解决方案的需求需要集成到具有严格的实时和功率限制的现有系统中。卷积神经网络（CNNS）目前在所有图像中都能理解基准测试的所有图像，但具有很高的计算复杂性。因此，嵌入式CNN要求提供小而有效但功能强大的计算平台。该主论文探讨了基于FPGA的CNN加速度的潜力，并在Zynq系统中证明了概念概念的CNN实现。 Zynqnet嵌入的CNN设计用于ImageNet上的图像分类，由Zynqnet CNN组成，它是一种优化和自定义的CNN拓扑结构，以及Zynqnet FPGA Accelerator，FPGA基于FPGA的架构，用于其评估。 Zynqnet CNN是高效的CNN拓扑。使用自定义设计的NetScope CNN分析仪对先前拓扑的详细分析和优化使CNN具有84.5％的TOP-5准确性，其计算复杂性仅为5.3亿多lydlyAccumulate操作。拓扑是高度规律的，仅由卷积层，依赖非线性和一个全球合并层组成。 CNN理想地适合FPGA加速器。 Zynqnet FPGA加速器允许对Zynqnet CNN进行有效评估。它基于嵌套环算法加速了整个网络，该算法将算术操作和内存访问的数量最小化。 FPGA加速器已使用Xilinx Zynq XC-7Z045的高级合成合成，并达到200MHz的时钟频率，设备利用率为80％至90％。

Image Understanding is becoming a vital feature in ever more applications ranging from medical diagnostics to autonomous vehicles. Many applications demand for embedded solutions that integrate into existing systems with tight real-time and power constraints. Convolutional Neural Networks (CNNs) presently achieve record-breaking accuracies in all image understanding benchmarks, but have a very high computational complexity. Embedded CNNs thus call for small and efficient, yet very powerful computing platforms. This master thesis explores the potential of FPGA-based CNN acceleration and demonstrates a fully functional proof-of-concept CNN implementation on a Zynq System-on-Chip. The ZynqNet Embedded CNN is designed for image classification on ImageNet and consists of ZynqNet CNN, an optimized and customized CNN topology, and the ZynqNet FPGA Accelerator, an FPGA-based architecture for its evaluation. ZynqNet CNN is a highly efficient CNN topology. Detailed analysis and optimization of prior topologies using the custom-designed Netscope CNN Analyzer have enabled a CNN with 84.5% top-5 accuracy at a computational complexity of only 530 million multiplyaccumulate operations. The topology is highly regular and consists exclusively of convolutional layers, ReLU nonlinearities and one global pooling layer. The CNN fits ideally onto the FPGA accelerator. The ZynqNet FPGA Accelerator allows an efficient evaluation of ZynqNet CNN. It accelerates the full network based on a nested-loop algorithm which minimizes the number of arithmetic operations and memory accesses. The FPGA accelerator has been synthesized using High-Level Synthesis for the Xilinx Zynq XC-7Z045, and reaches a clock frequency of 200MHz with a device utilization of 80% to 90 %.

下载PDF全文

下载文献需遵守相关版权规定

论文标题