论文标题
用多层嵌入培训以减少模型
Training with Multi-Layer Embeddings for Model Reduction
论文作者
论文摘要
现代推荐系统依赖于分类特征的实价嵌入。增加嵌入向量的尺寸提高了模型的准确性,但型号尺寸成本很高。我们引入了多层嵌入训练(MLET)体系结构,该体系结构通过一系列线性层训练嵌入嵌入,以得出较高的嵌入精度与模型尺寸折衷。 我们的方法从根本上是基于分解线性层产生优质嵌入到单线性层的方法的能力。我们专注于两层方案的分析和实施。利用线性神经网络中反向传播动力学的最新结果,我们解释了通过其具有较低有效等级的趋势获取出色多层嵌入的能力。我们表明,在隐藏层的宽度远大于最终嵌入(d)的宽度中,获得了很大的优势。至关重要的是,在训练的结论中,我们将两层溶液转换为单层溶液:结果,推理时间模型大小为d。 我们在Facebook的基于Pytorch的开源深度学习推荐模型中原型MEL计划。我们表明,它允许以给定模型的精度将D减少4-8倍,并以相应的内存足迹改进。这些实验是在两个公开可用的点击率预测基准(Criteo-Kaggle和Avazu)上运行的。 MLET的运行时成本平均为25%。
Modern recommendation systems rely on real-valued embeddings of categorical features. Increasing the dimension of embedding vectors improves model accuracy but comes at a high cost to model size. We introduce a multi-layer embedding training (MLET) architecture that trains embeddings via a sequence of linear layers to derive superior embedding accuracy vs. model size trade-off. Our approach is fundamentally based on the ability of factorized linear layers to produce superior embeddings to that of a single linear layer. We focus on the analysis and implementation of a two-layer scheme. Harnessing the recent results in dynamics of backpropagation in linear neural networks, we explain the ability to get superior multi-layer embeddings via their tendency to have lower effective rank. We show that substantial advantages are obtained in the regime where the width of the hidden layer is much larger than that of the final embedding (d). Crucially, at conclusion of training, we convert the two-layer solution into a single-layer one: as a result, the inference-time model size scales as d. We prototype the MLET scheme within Facebook's PyTorch-based open-source Deep Learning Recommendation Model. We show that it allows reducing d by 4-8X, with a corresponding improvement in memory footprint, at given model accuracy. The experiments are run on two publicly available click-through-rate prediction benchmarks (Criteo-Kaggle and Avazu). The runtime cost of MLET is 25%, on average.