Vyasa：Xilinx AI引擎上张量卷积的高性能矢量化编译器

论文标题

Vyasa：Xilinx AI引擎上张量卷积的高性能矢量化编译器

Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine

论文作者

Chatarasi, Prasanth, Neuendorffer, Stephen, Bayliss, Samuel, Vissers, Kees, Sarkar, Vivek

论文摘要

Xilinx的AI引擎是最新的节能矢量处理的示例，其中包括对2D SIMD数据调整和洗牌互连网络的新支持。当前编程AI引擎的方法依赖于c/c ++ API用于矢量内在。尽管对组装级的编程进行了进步，但它要求程序员根据硬件的详细知识来指定许多低级操作。为了应对这些挑战，我们介绍了Vyasa，这是一种新的编程系统，该系统将Halide DSL编译器扩展到自动为AI引擎生成代码。我们在36个Conv2d和6个Conv3D工作负载上评估了Vyasa，并获得了32位和16位操作数的7.6和23.3 Mac/Cycle的几何平均值（分别代表峰值性能的95.9％和72.8％）。对于我们可以使用的专家编写代码的4个工作负载，Vyasa证明了相对于专家写的代码，几何平均性能提高了1.10倍，较小的代码较小。

Xilinx's AI Engine is a recent industry example of energy-efficient vector processing that includes novel support for 2D SIMD datapaths and shuffle interconnection network. The current approach to programming the AI Engine relies on a C/C++ API for vector intrinsics. While an advance over assembly-level programming, it requires the programmer to specify a number of low-level operations based on detailed knowledge of the hardware. To address these challenges, we introduce Vyasa, a new programming system that extends the Halide DSL compiler to automatically generate code for the AI Engine. We evaluated Vyasa on 36 CONV2D and 6 CONV3D workloads, and achieved geometric means of 7.6 and 23.3 MACs/cycle for 32-bit and 16-bit operands (which represent 95.9% and 72.8% of the peak performance respectively). For 4 of these workloads for which expert-written codes were available to us, Vyasa demonstrated a geometric mean performance improvement of 1.10x with 50x smaller code relative to the expert-written codes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题