从消防饮酒：持续学习的网络尺度自然语言

论文标题

从消防饮酒：持续学习的网络尺度自然语言

Drinking from a Firehose: Continual Learning with Web-scale Natural Language

论文作者

Hu, Hexiang, Sener, Ozan, Sha, Fei, Koltun, Vladlen

论文摘要

持续的学习系统将与人类，彼此之间以及随着时间的流逝与人类互动，并继续像他们一样继续学习和适应。持续学习的重要开放问题是一个大规模的基准，可以实现对算法的现实评估。在本文中，我们研究了大规模持续学习的自然环境。我们介绍了个性化的在线语言学习（民意调查）的问题，该问题涉及将个性化语言模型拟合到随着时间的流逝而发展的用户。为了促进对民意调查的研究，我们收集了Twitter帖子的大量数据集。这些数据集（FireHose 100M和FireHose 100m）包含1亿条推文，由一百万用户在六年内发布。由Firehose数据集启用，我们在前所未有的规模上对持续学习算法进行了严格的评估。基于此分析，我们开发了一种用于连续梯度下降的简单算法（Morgud），该算法在Firehose数据集以及早期的基准测试中胜过先前的持续学习方法。总体而言，民意调查问题设置，FireHose数据集和Grege Algorithm为Web规模持续学习的可重复研究提供了完整的基准。

Continual learning systems will interact with humans, with each other, and with the physical world through time -- and continue to learn and adapt as they do. An important open problem for continual learning is a large-scale benchmark that enables realistic evaluation of algorithms. In this paper, we study a natural setting for continual learning on a massive scale. We introduce the problem of personalized online language learning (POLL), which involves fitting personalized language models to a population of users that evolves over time. To facilitate research on POLL, we collect massive datasets of Twitter posts. These datasets, Firehose10M and Firehose100M, comprise 100 million tweets, posted by one million users over six years. Enabled by the Firehose datasets, we present a rigorous evaluation of continual learning algorithms on an unprecedented scale. Based on this analysis, we develop a simple algorithm for continual gradient descent (ConGraD) that outperforms prior continual learning methods on the Firehose datasets as well as earlier benchmarks. Collectively, the POLL problem setting, the Firehose datasets, and the ConGraD algorithm enable a complete benchmark for reproducible research on web-scale continual learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题