论文标题

Fasthan:用于中国NLP的基于BERT的多任务工具包

fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP

论文作者

Geng, Zhichao, Yan, Hang, Qiu, Xipeng, Huang, Xuanjing

论文摘要

我们介绍了Fasthan,这是一种用于中文自然语言处理的四个基本任务的开源工具包:中文单词细分(CWS),语音部分(POS)标记,命名实体识别(NER)和依赖性解析。 Fasthan的骨干是基于修剪的Bert的多任务模型,该模型使用BERT中的前8层。我们还提供了从8层模型压缩的4层基本模型。对联合模型进行了培训和评估,并在13个Corpora中进行了四个任务,在依赖性解析和NER中获得了最先进的(SOTA)性能,从而在CWS和POS中实现了SOTA性能。此外,Fasthan的可传递性也很强,在非训练语料库上的表现要比流行的分割工具要好得多。为了更好地满足实际应用的需求,我们允许用户使用自己的标记数据来进一步调整Fasthan。除了其尺寸较小和出色的性能外,Fasthan还是用户友好的。 Fasthan以Python软件包的形式实现,将用户与内部技术细节隔离开来,并且可以方便使用。该项目在Github上发布。

We present fastHan, an open-source toolkit for four basic tasks in Chinese natural language processing: Chinese word segmentation (CWS), Part-of-Speech (POS) tagging, named entity recognition (NER), and dependency parsing. The backbone of fastHan is a multi-task model based on a pruned BERT, which uses the first 8 layers in BERT. We also provide a 4-layer base model compressed from the 8-layer model. The joint-model is trained and evaluated on 13 corpora of four tasks, yielding near state-of-the-art (SOTA) performance in dependency parsing and NER, achieving SOTA performance in CWS and POS. Besides, fastHan's transferability is also strong, performing much better than popular segmentation tools on a non-training corpus. To better meet the need of practical application, we allow users to use their own labeled data to further fine-tune fastHan. In addition to its small size and excellent performance, fastHan is user-friendly. Implemented as a python package, fastHan isolates users from the internal technical details and is convenient to use. The project is released on Github.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源