设计负面采样策略以远距离监督的技能提取

论文标题

设计负面采样策略以远距离监督的技能提取

Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

论文作者

Decorte, Jens-Joris, Van Hautte, Jeroen, Deleu, Johannes, Develder, Chris, Demeester, Thomas

论文摘要

技能在就业市场和许多人力资源（HR）过程中起着核心作用。在其他数字经验之后，当今的在线工作市场有候选人期望根据他们的技能看到正确的机会。同样，企业越来越需要使用数据来确保其劳动力中的技能保持未来。但是，有关技能的结构化信息通常缺少，并且基于自我或经理评估的过程已证明与所得数据的采用，完整性和新鲜度有关。鉴于明确或仅隐含地描述了数千种可能的技能标签，并且缺乏精细注释的培训语料库，提取技能是一项艰巨的任务。以前的技能提取工作过于简化任务，将其简化为明确的实体检测任务，或者基于手动注释的培训数据，如果应用于完整的技能词汇量，这将是不可行的。我们根据遥远的字面匹配，提出了一个用于技能提取的端到端系统。我们提出并评估了几种负面验证数据集中的几种负面抽样策略，以提高对隐式提及技能的技能提取的概括，尽管在遥远的监督数据中缺乏这种隐式技能。我们观察到，使用ESCO分类法从相关技能中选择负面示例会产生最大的进步，并且在一个模型中结合三种不同的策略进一步提高了性能，在RP@5中最多可达8个百分点。我们基于ESCO分类法引入了一个手动注释的评估基准，以进行技能提取，并在其上验证模型。我们发布基准数据集以进行研究目的，以刺激对任务的进一步研究。

Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences, today's online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with issues around adoption, completeness, and freshness of the resulting data. Extracting skills is a highly challenging task, given the many thousands of possible skill labels mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements, and combining three different strategies in one model further increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for research purposes to stimulate further research on the task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题