论文标题

用概率变压器替代样品优化样品高效的优化

Sample-Efficient Optimisation with Probabilistic Transformer Surrogates

论文作者

Maraval, Alexandre, Zimmer, Matthieu, Grosnit, Antoine, Tutunov, Rasul, Wang, Jun, Ammar, Haitham Bou

论文摘要

面对复杂性日益增加的问题,贝叶斯优化(BO)的最新研究重点是将深层概率模型改编为高斯工艺(GPS)的灵活替代品。同样,本文研究了BO中采用最先进的概率变压器的可行性。经过进一步的调查,我们观察到了两个缺点,源于它们的训练程序和损失的定义,阻碍了他们作为黑盒优化的代理的直接部署。首先,我们注意到这些模型经过均匀分布的输入培训,这会损害非均匀数据的预测准确性 - 由于勘探 - 探索探索权衡取舍而引起的任何典型的BO循环产生的设置。其次,我们意识到,训练损失(例如,跨凝性)仅渐近地保证了准确的后近似值,即在达到全局最佳最佳之后,通常无法确保。但是,在损失函数的固定点上,我们观察到预测性能的降解,尤其是在输入空间的探索区域中。为了解决这些缺点,我们介绍了两个组成部分:1)在支持非均匀分布点之前进行的BO量训练,以及2)新型的近似后正规散发器交易准确性和输入灵敏度,以滤除有利的固定点,以提高预测性能。在大型实验面板中,我们首次证明了在随机GP先验中对数据采样的数据预先训练的一个变压器与基于GP的BO相比,在16个基准黑盒中产生竞争结果。由于我们的模型仅经过一次预训练,并且在所有任务中都使用,而无需进行任何重新调查和/或微调,因此我们报告了降级的数量级,同时匹配且有时超过了GPS。

Faced with problems of increasing complexity, recent research in Bayesian Optimisation (BO) has focused on adapting deep probabilistic models as flexible alternatives to Gaussian Processes (GPs). In a similar vein, this paper investigates the feasibility of employing state-of-the-art probabilistic transformers in BO. Upon further investigation, we observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. First, we notice that these models are trained on uniformly distributed inputs, which impairs predictive accuracy on non-uniform data - a setting arising from any typical BO loop due to exploration-exploitation trade-offs. Second, we realise that training losses (e.g., cross-entropy) only asymptotically guarantee accurate posterior approximations, i.e., after arriving at the global optimum, which generally cannot be ensured. At the stationary points of the loss function, however, we observe a degradation in predictive performance especially in exploratory regions of the input space. To tackle these shortcomings we introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance. In a large panel of experiments, we demonstrate, for the first time, that one transformer pre-trained on data sampled from random GP priors produces competitive results on 16 benchmark black-boxes compared to GP-based BO. Since our model is only pre-trained once and used in all tasks without any retraining and/or fine-tuning, we report an order of magnitude time-reduction, while matching and sometimes outperforming GPs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源