通过合并不同权重的DNN来加速多模型推断

论文标题

通过合并不同权重的DNN来加速多模型推断

Accelerating Multi-Model Inference by Merging DNNs of Different Weights

论文作者

Jeong, Joo Seong, Kim, Soojeong, Yu, Gyeong-In, Lee, Yunseong, Chun, Byung-Gon

论文摘要

被证明在机器学习任务上表现良好的标准化DNN模型被广泛使用，并且经常采用AS-IS来解决下游任务，从而形成了转移学习范式。但是，当从GPU服务器群中提供多个DNN模型的实例时，改善GPU利用（例如批处理）的技术是不可应用的，因为模型通常由于微调而无法共享权重。我们提出了NetFuse，这是一种合并多个DNN模型的技术，它们共享相同的体系结构但具有不同的权重和不同的输入。 NetFuse可以通过用更通用的对应物替换操作来实现NetFuse，这些操作只能与一组输入相关联。关于Resnet-50，Resnext-50，Bert和XLNet的实验表明，NetFuse可以在NVIDIA V100 GPU上加快DNN推理时间高达3.6倍，在合并32个模型实例时，在Titan XP GPU上最多可以加快3.0倍，而仅使用少量的GPU存储器，而titan XP GPU则可以加快the XP GPU的速度。

Standardized DNN models that have been proved to perform well on machine learning tasks are widely used and often adopted as-is to solve downstream tasks, forming the transfer learning paradigm. However, when serving multiple instances of such DNN models from a cluster of GPU servers, existing techniques to improve GPU utilization such as batching are inapplicable because models often do not share weights due to fine-tuning. We propose NetFuse, a technique of merging multiple DNN models that share the same architecture but have different weights and different inputs. NetFuse is made possible by replacing operations with more general counterparts that allow a set of weights to be associated with only a certain set of inputs. Experiments on ResNet-50, ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time up to 3.6x on a NVIDIA V100 GPU, and up to 3.0x on a TITAN Xp GPU when merging 32 model instances, while only using up a small additional amount of GPU memory.

下载PDF全文

下载文献需遵守相关版权规定

论文标题