论文标题

使用Bert,CRF和Wikipedia检测新闻中的潜在主题

Detecting Potential Topics In News Using BERT, CRF and Wikipedia

论文作者

Jadhav, Swapnil Ashok

论文摘要

对于像DailyHunt这样的新闻内容发行平台,名为“实体识别”是构建更好的用户建议和通知算法的关键任务。除了确定13多种印度语言的新闻中的名称,位置,组织,并在算法中使用它们外,我们还需要识别不一定适合命名实体定义的n-grams,但它们很重要。例如,“我也是动作”,“牛肉禁令”,“ Alwar Mob Lynching”。在此练习中,鉴于英语文本,我们正在尝试检测无案例的n-grams,这些n-gram传达了重要的信息,并且可以用作新闻的主题和/或主题标签。模型是使用Wikipedia标题数据,私人英语新闻语料库和BERT-Multlingual预培训模型,BI-GRU和CRF架构构建的。就F1而言,与行业最佳的天赋,Spacy和Stanford-Caseless-ner相比,它显示出令人鼓舞的结果。

For a news content distribution platform like Dailyhunt, Named Entity Recognition is a pivotal task for building better user recommendation and notification algorithms. Apart from identifying names, locations, organisations from the news for 13+ Indian languages and use them in algorithms, we also need to identify n-grams which do not necessarily fit in the definition of Named-Entity, yet they are important. For example, "me too movement", "beef ban", "alwar mob lynching". In this exercise, given an English language text, we are trying to detect case-less n-grams which convey important information and can be used as topics and/or hashtags for a news. Model is built using Wikipedia titles data, private English news corpus and BERT-Multilingual pre-trained model, Bi-GRU and CRF architecture. It shows promising results when compared with industry best Flair, Spacy and Stanford-caseless-NER in terms of F1 and especially Recall.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源