论文标题
在地理和上下文上记录各种数据来源:语言数据和资源的Bigscience目录
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources
论文作者
论文摘要
近年来,大规模的数据收集工作优先考虑收集的数据量,以提高大语模型的建模功能。然而,这种优先级对数据收集中表示的数据主体的权利引起了人们的关注,尤其是在考虑由于文档不足和分析工具而考虑到这些收集的困难时。考虑到这些陷阱,我们介绍了作为BigScience倡议的一部分,以文档优先,以人为中心的数据收集项目为基础。我们确定了一组各种各样的目标语言群体(阿拉伯语,巴斯克语,中文,加泰罗尼亚,英语,法语,法语,印度语言,印尼语,尼日尔 - 哥哥语言,葡萄牙语,西班牙语和越南语,以及编程语言),以便在潜在数据源上收集元数据。为了构建这项工作,我们开发了在线目录,作为通过有组织的公共黑客马拉松收集元数据的支持工具。我们介绍我们的发展过程;对产生的资源元数据进行分析,包括对语言,区域和资源类型的分布;我们在这项工作中学到的教训。
In recent years, large-scale data collection efforts have prioritized the amount of data collected in order to improve the modeling capabilities of large language models. This prioritization, however, has resulted in concerns with respect to the rights of data subjects represented in data collections, particularly when considering the difficulty in interrogating these collections due to insufficient documentation and tools for analysis. Mindful of these pitfalls, we present our methodology for a documentation-first, human-centered data collection project as part of the BigScience initiative. We identified a geographically diverse set of target language groups (Arabic, Basque, Chinese, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. To structure this effort, we developed our online catalogue as a supporting tool for gathering metadata through organized public hackathons. We present our development process; analyses of the resulting resource metadata, including distributions over languages, regions, and resource types; and our lessons learned in this endeavor.