自然语言任务中多个领域的积极学习

论文标题

自然语言任务中多个领域的积极学习

Active Learning Over Multiple Domains in Natural Language Tasks

论文作者

Longpre, Shayne, Reisler, Julia, Huang, Edward Greg, Lu, Yi, Frank, Andrew, Ramesh, Nikhil, DuBois, Chris

论文摘要

主动学习的研究传统上假设目标和源数据源自单个领域。但是，在现实的应用程序中，从业人员通常需要使用多个分布数据来源的积极学习，在此尚不清楚数据源将有助于或伤害目标域的先验尚不清楚。我们在主动学习（AL），域转移检测（DS）和多域抽样中调查了各种各样的技术，以检查问题回答和情感分析的挑战性环境。我们问（1）哪些方法对此任务有效？（2）选定的示例和域的哪些特性取得了强大的结果？在来自4个方法家族的18个采集函数中，我们发现H-差异方法，尤其是我们提出的变体DAL-E产生有效的结果，平均比随机基线提高了2-3％。我们还展示了域分配的重要性，以及在域选择和示例选择上对现有方法的改善。我们的发现对现有的和新颖的方法进行了首次全面分析，该方法对自然语言任务的多域积极学习面临。

Studies of active learning traditionally assume the target and source data stem from a single domain. However, in realistic applications, practitioners often require active learning with multiple sources of out-of-distribution data, where it is unclear a priori which data sources will help or hurt the target domain. We survey a wide variety of techniques in active learning (AL), domain shift detection (DS), and multi-domain sampling to examine this challenging setting for question answering and sentiment analysis. We ask (1) what family of methods are effective for this task? And, (2) what properties of selected examples and domains achieve strong results? Among 18 acquisition functions from 4 families of methods, we find H-Divergence methods, and particularly our proposed variant DAL-E, yield effective results, averaging 2-3% improvements over the random baseline. We also show the importance of a diverse allocation of domains, as well as room-for-improvement of existing methods on both domain and example selection. Our findings yield the first comprehensive analysis of both existing and novel methods for practitioners faced with multi-domain active learning for natural language tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题