论文标题
生成模型是无监督的页面质量预测指标:一项巨大的研究
Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study
论文作者
论文摘要
大型生成语言模型(例如GPT-2)以其生成文本的能力以及通过微调而在下游任务中的实用程序而闻名。我们的工作是双重的:首先,我们通过人类评估证明,经过培训的分类器可以区分人类和机器生成的文本作为“页面质量”的无监督预测指标,能够在没有任何培训的情况下检测低质量的内容。这使得在低资源设置中可以快速对质量指标进行快速启动。其次,好奇地了解野外低质量页面的流行和性质,我们进行了超过5亿个网络文章的广泛定性和定量分析,这使得这是有史以来对该主题进行的最大尺度研究。
Large generative language models such as GPT-2 are well-known for their ability to generate text as well as their utility in supervised downstream tasks via fine-tuning. Our work is twofold: firstly we demonstrate via human evaluation that classifiers trained to discriminate between human and machine-generated text emerge as unsupervised predictors of "page quality", able to detect low quality content without any training. This enables fast bootstrapping of quality indicators in a low-resource setting. Secondly, curious to understand the prevalence and nature of low quality pages in the wild, we conduct extensive qualitative and quantitative analysis over 500 million web articles, making this the largest-scale study ever conducted on the topic.