部分可观测时空混沌系统的无模型预测

论文标题

部分可观测时空混沌系统的无模型预测

HiKonv: Maximizing the Throughput of Quantized Convolution With Novel Bit-wise Management and Computation

论文作者

Chen, Yao, Pan, Junhao, Liu, Xinheng, Xiong, Jinjun, Chen, Deming

论文摘要

CNN的量化已显示出显着的进展，目的是通过低位宽度数据表示降低计算成本和存储成本。但是，没有关于现有的全宽度处理单元（例如CPU中的ALU和FPGA中的DSP）如何更好地使用系统的系统研究，可以更好地利用在各种量化的位低宽下为卷积提供明显更高的计算吞吐量。在这项研究中，我们提出了Hikonv，这是一种统一的解决方案，可通过新颖的位置管理和并行计算，在给定的基础处理单元上最大化卷积的吞吐量。我们使用全宽度乘数来建立理论框架和性能模型，用于高度平行的低位宽卷积，并在此关键领域中展示了对高性能计算的新突破。例如，CPU中的单个32位处理单元可以提供128个二进制卷积操作（乘法和添加）和13个带有单个乘法指令的4位卷积操作，以及一个27x18的乘数在一个时钟输入周期内提供1、4或8位输入。我们证明了Hikonv对CPU和FPGA的有效性。在CPU上，HikOnv优于1到8位输入的基线实现，并为1D卷积提供高达7.6倍和1.4倍的性能改进，并在4位签名和无符号数据输入的基线实现上执行2.74 x和3.19倍，以进行2D卷积。在FPGA上，Hikonv解决方案使单个DSP能够以较短的处理延迟处理多个卷积。对于二进制输入，每个带有Hikonv的DSP等同于76.6个LUTS。与DAC-SDC 2020冠军模型相比，Hikonv分别实现了2.37倍的吞吐量改进和2.61倍的DSP效率提高。

Quantization for CNN has shown significant progress with the intention of reducing the cost of computation and storage with low-bitwidth data representations. There are, however, no systematic studies on how an existing full-bitwidth processing unit, such as ALU in CPUs and DSP in FPGAs, can be better utilized to deliver significantly higher computation throughput for convolution under various quantized bitwidths. In this study, we propose HiKonv, a unified solution that maximizes the throughput of convolution on a given underlying processing unit with low-bitwidth quantized data inputs through novel bit-wise management and parallel computation. We establish theoretical framework and performance models using a full-bitwidth multiplier for highly parallelized low-bitwidth convolution, and demonstrate new breakthroughs for high-performance computing in this critical domain. For example, a single 32-bit processing unit in CPU can deliver 128 binarized convolution operations (multiplications and additions) and 13 4-bit convolution operations with a single multiplication instruction, and a single 27x18 multiplier in the FPGA DSP can deliver 60, 8 or 2 convolution operations with 1, 4 or 8-bit inputs in one clock cycle. We demonstrate the effectiveness of HiKonv on both CPU and FPGA. On CPU, HiKonv outperforms the baseline implementation with 1 to 8-bit inputs and provides up to 7.6x and 1.4x performance improvements for 1-D convolution, and performs 2.74x and 3.19x over the baseline implementation for 4-bit signed and unsigned data inputs for 2-D convolution. On FPGA, HiKonv solution enables a single DSP to process multiple convolutions with a shorter processing latency. For binarized input, each DSP with HiKonv is equivalent up to 76.6 LUTs. Compared to the DAC-SDC 2020 champion model, HiKonv achieves a 2.37x throughput improvement and 2.61x DSP efficiency improvement, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题