生成模型是无监督的页面质量预测指标：一项巨大的研究

论文标题

生成模型是无监督的页面质量预测指标：一项巨大的研究

Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

论文作者

Bahri, Dara, Tay, Yi, Zheng, Che, Metzler, Donald, Brunk, Cliff, Tomkins, Andrew

论文摘要

大型生成语言模型（例如GPT-2）以其生成文本的能力以及通过微调而在下游任务中的实用程序而闻名。我们的工作是双重的：首先，我们通过人类评估证明，经过培训的分类器可以区分人类和机器生成的文本作为“页面质量”的无监督预测指标，能够在没有任何培训的情况下检测低质量的内容。这使得在低资源设置中可以快速对质量指标进行快速启动。其次，好奇地了解野外低质量页面的流行和性质，我们进行了超过5亿个网络文章的广泛定性和定量分析，这使得这是有史以来对该主题进行的最大尺度研究。

Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.

下载PDF全文

下载文献需遵守相关版权规定

论文标题