用概率变压器替代样品优化样品高效的优化

论文标题

用概率变压器替代样品优化样品高效的优化

Sample-Efficient Optimisation with Probabilistic Transformer Surrogates

论文作者

Maraval, Alexandre, Zimmer, Matthieu, Grosnit, Antoine, Tutunov, Rasul, Wang, Jun, Ammar, Haitham Bou

论文摘要

面对复杂性日益增加的问题，贝叶斯优化（BO）的最新研究重点是将深层概率模型改编为高斯工艺（GPS）的灵活替代品。同样，本文研究了BO中采用最先进的概率变压器的可行性。经过进一步的调查，我们观察到了两个缺点，源于它们的训练程序和损失的定义，阻碍了他们作为黑盒优化的代理的直接部署。首先，我们注意到这些模型经过均匀分布的输入培训，这会损害非均匀数据的预测准确性 - 由于勘探 - 探索探索权衡取舍而引起的任何典型的BO循环产生的设置。其次，我们意识到，训练损失（例如，跨凝性）仅渐近地保证了准确的后近似值，即在达到全局最佳最佳之后，通常无法确保。但是，在损失函数的固定点上，我们观察到预测性能的降解，尤其是在输入空间的探索区域中。为了解决这些缺点，我们介绍了两个组成部分：1）在支持非均匀分布点之前进行的BO量训练，以及2）新型的近似后正规散发器交易准确性和输入灵敏度，以滤除有利的固定点，以提高预测性能。在大型实验面板中，我们首次证明了在随机GP先验中对数据采样的数据预先训练的一个变压器与基于GP的BO相比，在16个基准黑盒中产生竞争结果。由于我们的模型仅经过一次预训练，并且在所有任务中都使用，而无需进行任何重新调查和/或微调，因此我们报告了降级的数量级，同时匹配且有时超过了GPS。

Faced with problems of increasing complexity, recent research in Bayesian Optimisation (BO) has focused on adapting deep probabilistic models as flexible alternatives to Gaussian Processes (GPs). In a similar vein, this paper investigates the feasibility of employing state-of-the-art probabilistic transformers in BO. Upon further investigation, we observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. First, we notice that these models are trained on uniformly distributed inputs, which impairs predictive accuracy on non-uniform data - a setting arising from any typical BO loop due to exploration-exploitation trade-offs. Second, we realise that training losses (e.g., cross-entropy) only asymptotically guarantee accurate posterior approximations, i.e., after arriving at the global optimum, which generally cannot be ensured. At the stationary points of the loss function, however, we observe a degradation in predictive performance especially in exploratory regions of the input space. To tackle these shortcomings we introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance. In a large panel of experiments, we demonstrate, for the first time, that one transformer pre-trained on data sampled from random GP priors produces competitive results on 16 benchmark black-boxes compared to GP-based BO. Since our model is only pre-trained once and used in all tasks without any retraining and/or fine-tuning, we report an order of magnitude time-reduction, while matching and sometimes outperforming GPs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题