Tabsim：暹罗神经网络，用于准确估计表相似性

论文标题

Tabsim：暹罗神经网络，用于准确估计表相似性

TabSim: A Siamese Neural Network for Accurate Estimation of Table Similarity

论文作者

Habibi, Maryam, Starlinger, Johannes, Leser, Ulf

论文摘要

表是提供结构化信息的一种流行和有效的方法。它们在包括网页在内的各种文档中广泛使用。表显示信息为二维矩阵，其语义是由结构（行，列），标题，标题和内容的混合物传达的。最近的研究已经开始将表作为头等对象，而不仅仅是文本的附录，从而为诸如表匹配，表格完成或价值插补的问题带来了有趣的结果。所有这些问题固有地依赖于两个表格的语义相似性的准确度量。我们提出了TabSim，这是一种使用深神经网络计算表相似性分数的新方法。从概念上讲，TABSIM代表一个表格，其标题，内容和结构的嵌入串联。在此表示中，给定两个表，对暹罗神经网络进行了训练，以计算与表格的语义相似性相关的分数。为了训练和评估我们的方法，我们创建了一个黄金标准语料库，该语料库由1500个桌子对组成，这些桌子对从生物医学文章中提取，并在其相似程度上进行了手动评分，并采用了最初针对不同但相似的任务开发的另外两个Corpora。我们的评估表明，TABSIM平均超过其他表相似性度量。在二进制相似性分类设置和按应用程序中，7％PP F1得分。在排名方案中为1.5％。

Tables are a popular and efficient means of presenting structured information. They are used extensively in various kinds of documents including web pages. Tables display information as a two-dimensional matrix, the semantics of which is conveyed by a mixture of structure (rows, columns), headers, caption, and content. Recent research has started to consider tables as first class objects, not just as an addendum to texts, yielding interesting results for problems like table matching, table completion, or value imputation. All of these problems inherently rely on an accurate measure for the semantic similarity of two tables. We present TabSim, a novel method to compute table similarity scores using deep neural networks. Conceptually, TabSim represents a table as a learned concatenation of embeddings of its caption, its content, and its structure. Given two tables in this representation, a Siamese neural network is trained to compute a score correlating with the tables' semantic similarity. To train and evaluate our method, we created a gold standard corpus consisting of 1500 table pairs extracted from biomedical articles and manually scored regarding their degree of similarity, and adopted two other corpora originally developed for a different yet similar task. Our evaluation shows that TabSim outperforms other table similarity measures on average by app. 7% pp F1-score in a binary similarity classification setting and by app. 1.5% pp in a ranking scenario.

下载PDF全文

下载文献需遵守相关版权规定

论文标题