用众包标签从有限的教育数据中获得代表性学习

论文标题

用众包标签从有限的教育数据中获得代表性学习

Representation Learning from Limited Educational Data with Crowdsourced Labels

论文作者

Wang, Wentao, Xu, Guowei, Ding, Wenbiao, Huang, Gale Yan, Li, Guoliang, Tang, Jiliang, Liu, Zitao

论文摘要

事实证明，表示学习在许多任务（例如机器翻译，面部识别和建议）中的前所未有的成功中起着重要作用。大多数现有的表示学习方法通常需要大量一致且无噪声的标签。但是，由于预算限制和隐私问题等各种原因，在许多实际情况下，标签非常有限。在小标记的数据集上直接应用标准表示学习方法将很容易遇到过度拟合问题并导致次优的解决方案。更糟糕的是，在某些领域（例如教育）中，有限的标签通常由多种专业知识的工人注释，这在这种众包环境中产生了噪音和矛盾。在本文中，我们提出了一个新颖的框架，旨在从具有众包标签的有限数据中学习有效的表示。具体而言，我们设计了一个基于组的深神经网络，以从有限数量的培训样本中学习嵌入，并提供贝叶斯置信度估计器，以捕获众包标签之间的不一致。此外，为了加快培训过程，我们开发了一个艰难的示例选择程序，以适应模型错误分类的训练示例。在三个现实世界数据集上进行的广泛实验表明，与各种最先进的基线相比，我们从有限的数据中学习框架对学习表示的优越性。此外，我们还对我们提出的框架的每个主要组成部分提供了全面的分析，并介绍了它在我们的实际生产中取得的有希望的结果，以充分了解所提出的框架。

Representation learning has been proven to play an important role in the unprecedented success of machine learning models in numerous tasks, such as machine translation, face recognition and recommendation. The majority of existing representation learning approaches often require a large number of consistent and noise-free labels. However, due to various reasons such as budget constraints and privacy concerns, labels are very limited in many real-world scenarios. Directly applying standard representation learning approaches on small labeled data sets will easily run into over-fitting problems and lead to sub-optimal solutions. Even worse, in some domains such as education, the limited labels are usually annotated by multiple workers with diverse expertise, which yields noises and inconsistency in such crowdsourcing settings. In this paper, we propose a novel framework which aims to learn effective representations from limited data with crowdsourced labels. Specifically, we design a grouping based deep neural network to learn embeddings from a limited number of training samples and present a Bayesian confidence estimator to capture the inconsistency among crowdsourced labels. Furthermore, to expedite the training process, we develop a hard example selection procedure to adaptively pick up training examples that are misclassified by the model. Extensive experiments conducted on three real-world data sets demonstrate the superiority of our framework on learning representations from limited data with crowdsourced labels, comparing with various state-of-the-art baselines. In addition, we provide a comprehensive analysis on each of the main components of our proposed framework and also introduce the promising results it achieved in our real production to fully understand the proposed framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题