论文标题
评估模型量表对语义解析中组成概括的影响
Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing
论文作者
论文摘要
尽管在许多任务上表现出色,但已证明预先训练的语言模型在分布构成概括方面挣扎。同时,最近的工作显示了从模型缩放的许多NLP任务方面的改进。扩展模型大小还可以改善语义解析中的组成概括吗?我们评估了最多11B参数的编码器模型,最多可解码的模型最多可达540b参数,并比较了将预训练的语言模型应用于新任务的三种不同方法的模型缩放曲线与新任务:对所有参数,及时的TUNSING和CONTENTENT学习。我们观察到,微调在语义解析评估中通常具有平坦或负缩放曲线。内在学习具有正缩放曲线,但通常较小的微调模型表现出色。及时调整可以胜过微调,这表明缩放表现出更正面的缩放曲线,从而提出了进一步的潜在改进。此外,我们确定了几种随模型量表变化的错误趋势。例如,较大的模型通常可以更好地建模输出空间的语法,但也更容易容易出现某些类型的过拟合。总体而言,我们的研究强调了当前技术有效利用模型量表进行组成概括的局限性,而我们的分析也提出了未来工作的有希望的方向。
Despite their strong performance on many tasks, pre-trained language models have been shown to struggle on out-of-distribution compositional generalization. Meanwhile, recent work has shown considerable improvements on many NLP tasks from model scaling. Can scaling up model size also improve compositional generalization in semantic parsing? We evaluate encoder-decoder models up to 11B parameters and decoder-only models up to 540B parameters, and compare model scaling curves for three different methods for applying a pre-trained language model to a new task: fine-tuning all parameters, prompt tuning, and in-context learning. We observe that fine-tuning generally has flat or negative scaling curves on out-of-distribution compositional generalization in semantic parsing evaluations. In-context learning has positive scaling curves, but is generally outperformed by much smaller fine-tuned models. Prompt-tuning can outperform fine-tuning, suggesting further potential improvements from scaling as it exhibits a more positive scaling curve. Additionally, we identify several error trends that vary with model scale. For example, larger models are generally better at modeling the syntax of the output space, but are also more prone to certain types of overfitting. Overall, our study highlights limitations of current techniques for effectively leveraging model scale for compositional generalization, while our analysis also suggests promising directions for future work.