ENCBP：用于英语的新型基准数据集

论文标题

ENCBP：用于英语的新型基准数据集

EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background Prediction in English

论文作者

Ma, Weicheng, Datta, Samiha, Wang, Lili, Vosoughi, Soroush

论文摘要

尽管已显示文化背景会影响语言表达，但现有的自然语言处理（NLP）对文化建模的研究过于粗糙，并且没有研究同一语言的说话者之间的文化差异。为了解决具有文化背景特征的NLP模型，我们收集，注释，手动验证和Benchmark Encbp，这是一种基于英语的基于新闻新闻的文化背景预测数据集。通过语言建模（LM）评估和手动分析，我们确认五个英语国家和美国四个州之间语言表达式存在明显差异。此外，我们对九种句法（CONLL-2003），语义（paws-wiki，qnli，sTs-b和rte）以及心理语言任务（SST-5，SST-2，情感和GO-Emotions）的评估，同时介绍文化背景并不会使Go-Entromions Discortion domention dominica dominica noce to noce dominice dominica noce to noce dominica dominica noce dominica dominica noce noce dominabl其他任务。我们的发现强烈支持文化背景建模对各种NLP任务的重要性，并证明了ENCBP在与文化相关的研究中的适用性。

While cultural backgrounds have been shown to affect linguistic expressions, existing natural language processing (NLP) research on culture modeling is overly coarse-grained and does not examine cultural differences among speakers of the same language. To address this problem and augment NLP models with cultural background features, we collect, annotate, manually validate, and benchmark EnCBP, a finer-grained news-based cultural background prediction dataset in English. Through language modeling (LM) evaluations and manual analyses, we confirm that there are noticeable differences in linguistic expressions among five English-speaking countries and across four states in the US. Additionally, our evaluations on nine syntactic (CoNLL-2003), semantic (PAWS-Wiki, QNLI, STS-B, and RTE), and psycholinguistic tasks (SST-5, SST-2, Emotion, and Go-Emotions) show that, while introducing cultural background information does not benefit the Go-Emotions task due to text domain conflicts, it noticeably improves deep learning (DL) model performance on other tasks. Our findings strongly support the importance of cultural background modeling to a wide variety of NLP tasks and demonstrate the applicability of EnCBP in culture-related research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题