使用神经序列标记模型去除样板板

论文标题

使用神经序列标记模型去除样板板

Boilerplate Removal using a Neural Sequence Labeling Model

论文作者

Leonhardt, Jurek, Anand, Avishek, Khosla, Megha

论文摘要

从网页中提取主内容是许多应用程序的重要任务，从可用性方面（例如Web浏览器中的新闻文章的读者视图）到信息检索或自然语言处理。缺乏现有方法，因为它们依靠大量手工制作的特征进行分类。这导致了针对网页的特定分布量身定制的模型，例如从特定时间范围内，但缺乏泛化能力。我们提出了一个神经序列标记模型，该模型不依赖于任何手工制作的功能，而仅摄取在网页中显示为输入中的HTML标签和单词。这使我们能够提出一个浏览器扩展程序，该扩展名可以使用我们的模型在浏览器中直接突出显示任意网页的内容。此外，我们创建了一个新的，更新的数据集，以表明我们的模型能够适应网页结构的变化并超越最新模型。

The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题