使用Bert，CRF和Wikipedia检测新闻中的潜在主题

论文标题

使用Bert，CRF和Wikipedia检测新闻中的潜在主题

Detecting Potential Topics In News Using BERT, CRF and Wikipedia

论文作者

Jadhav, Swapnil Ashok

论文摘要

对于像DailyHunt这样的新闻内容发行平台，名为“实体识别”是构建更好的用户建议和通知算法的关键任务。除了确定13多种印度语言的新闻中的名称，位置，组织，并在算法中使用它们外，我们还需要识别不一定适合命名实体定义的n-grams，但它们很重要。例如，“我也是动作”，“牛肉禁令”，“ Alwar Mob Lynching”。在此练习中，鉴于英语文本，我们正在尝试检测无案例的n-grams，这些n-gram传达了重要的信息，并且可以用作新闻的主题和/或主题标签。模型是使用Wikipedia标题数据，私人英语新闻语料库和BERT-Multlingual预培训模型，BI-GRU和CRF架构构建的。就F1而言，与行业最佳的天赋，Spacy和Stanford-Caseless-ner相比，它显示出令人鼓舞的结果。

For a news content distribution platform like Dailyhunt, Named Entity Recognition is a pivotal task for building better user recommendation and notification algorithms. Apart from identifying names, locations, organisations from the news for 13+ Indian languages and use them in algorithms, we also need to identify n-grams which do not necessarily fit in the definition of Named-Entity, yet they are important. For example, "me too movement", "beef ban", "alwar mob lynching". In this exercise, given an English language text, we are trying to detect case-less n-grams which convey important information and can be used as topics and/or hashtags for a news. Model is built using Wikipedia titles data, private English news corpus and BERT-Multilingual pre-trained model, Bi-GRU and CRF architecture. It shows promising results when compared with industry best Flair, Spacy and Stanford-caseless-NER in terms of F1 and especially Recall.

下载PDF全文

下载文献需遵守相关版权规定

论文标题