论文标题

schema2qa:结构化网络的高质量和低成本问答代理

Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web

论文作者

Xu, Silei, Campagna, Giovanni, Li, Jian, Lam, Monica S.

论文摘要

目前,建立一个提问的代理需要大量的注释数据集,这些数据集非常昂贵。本文提出了Schema2QA,这是一种开源工具包,可以从数据库模式增强数据库架构中,并为每个字段提供一些注释。关键概念是涵盖数据库上可能的复合查询的空间,并在一系列通用查询模板的帮助下与大量的内域问题合成。合成的数据和较小的释义集用于基于BERT预告片模型来训练新型神经网络。我们使用schema2QA为五个schema.org域,餐馆,人,电影,书籍和音乐生成问答系统,并在这些域的众包问题上获得64%至75%的总体准确性。一旦获得了schema.org架构的注释和释义,就无需额外的手动工作即可为使用相同架构的任何网站创建问答代理。此外,我们证明可以将学习从餐厅转移到酒店域名,在没有手动努力的情况下,在众包问题上获得了64%的精度。 schema2qa在可以使用schema.org回答的流行餐厅问题上获得了60%的准确性。它的性能与Google Assistant相当,比Siri低7%,比Alexa高15%。在更复杂的长尾问题上,它的表现使所有这些助手的表现至少超过18%。

Building a question-answering agent currently requires large annotated datasets, which are prohibitively expensive. This paper proposes Schema2QA, an open-source toolkit that can generate a Q&A system from a database schema augmented with a few annotations for each field. The key concept is to cover the space of possible compound queries on the database with a large number of in-domain questions synthesized with the help of a corpus of generic query templates. The synthesized data and a small paraphrase set are used to train a novel neural network based on the BERT pretrained model. We use Schema2QA to generate Q&A systems for five Schema.org domains, restaurants, people, movies, books and music, and obtain an overall accuracy between 64% and 75% on crowdsourced questions for these domains. Once annotations and paraphrases are obtained for a Schema.org schema, no additional manual effort is needed to create a Q&A agent for any website that uses the same schema. Furthermore, we demonstrate that learning can be transferred from the restaurant to the hotel domain, obtaining a 64% accuracy on crowdsourced questions with no manual effort. Schema2QA achieves an accuracy of 60% on popular restaurant questions that can be answered using Schema.org. Its performance is comparable to Google Assistant, 7% lower than Siri, and 15% higher than Alexa. It outperforms all these assistants by at least 18% on more complex, long-tail questions.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源