Fasthan：用于中国NLP的基于BERT的多任务工具包

论文标题

Fasthan：用于中国NLP的基于BERT的多任务工具包

fastHan: A BERT-based Multi-Task Toolkit for Chinese NLP

论文作者

Geng, Zhichao, Yan, Hang, Qiu, Xipeng, Huang, Xuanjing

论文摘要

我们介绍了Fasthan，这是一种用于中文自然语言处理的四个基本任务的开源工具包：中文单词细分（CWS），语音部分（POS）标记，命名实体识别（NER）和依赖性解析。 Fasthan的骨干是基于修剪的Bert的多任务模型，该模型使用BERT中的前8层。我们还提供了从8层模型压缩的4层基本模型。对联合模型进行了培训和评估，并在13个Corpora中进行了四个任务，在依赖性解析和NER中获得了最先进的（SOTA）性能，从而在CWS和POS中实现了SOTA性能。此外，Fasthan的可传递性也很强，在非训练语料库上的表现要比流行的分割工具要好得多。为了更好地满足实际应用的需求，我们允许用户使用自己的标记数据来进一步调整Fasthan。除了其尺寸较小和出色的性能外，Fasthan还是用户友好的。 Fasthan以Python软件包的形式实现，将用户与内部技术细节隔离开来，并且可以方便使用。该项目在Github上发布。

We present fastHan, an open-source toolkit for four basic tasks in Chinese natural language processing: Chinese word segmentation (CWS), Part-of-Speech (POS) tagging, named entity recognition (NER), and dependency parsing. The backbone of fastHan is a multi-task model based on a pruned BERT, which uses the first 8 layers in BERT. We also provide a 4-layer base model compressed from the 8-layer model. The joint-model is trained and evaluated on 13 corpora of four tasks, yielding near state-of-the-art (SOTA) performance in dependency parsing and NER, achieving SOTA performance in CWS and POS. Besides, fastHan's transferability is also strong, performing much better than popular segmentation tools on a non-training corpus. To better meet the need of practical application, we allow users to use their own labeled data to further fine-tune fastHan. In addition to its small size and excellent performance, fastHan is user-friendly. Implemented as a python package, fastHan isolates users from the internal technical details and is convenient to use. The project is released on Github.

下载PDF全文

下载文献需遵守相关版权规定

论文标题