论文标题
Storywrangler:使用Twitter的社会语言,文化,社会经济和政治时间表的大规模探索者
Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter
论文作者
论文摘要
实时的社交媒体数据强烈印记世界活动,流行文化以及数百万个普通人的日常对话,几乎没有传统和记录。在许多标准语料库(例如书籍和新闻档案)中,分享和评论机制缺乏社交媒体平台,使我们能够量化趋势故事情节和当代文化现象的社交扩增(即普及)。在这里,我们描述了Storywrangler,这是一种自然语言处理工具,旨在进行持续的一日策划,超过1000亿推文,其中包含从2008年到2021年的大约1万亿克拉,每天,我们将推文分解为Unigrams,Bigrams,Bigrams和Trigrams,Trigrams跨越了100种语言。我们跟踪n-gram的用法频率,并生成ZIPF分布,以获取单词,主题标签,手柄,数字,符号和表情符号。我们通过交互式时间序列查看器以及可下载的时间序列和每日分布来提供数据集。尽管Storywrangler利用了Twitter数据,但是我们提取和跟踪N-Grams动态变化的方法可以扩展到任何类似的社交媒体平台。我们展示了一些我们旨在实现许多可能的研究途径的例子,包括如何通过“竞争力图”可视化社交放大。我们还提出了一些示例研究,这些案例研究将N-Gram时间序列与不同的数据源架起,以探索著名人物的社会技术动态,票房成功和社会动荡。
In real-time, social media data strongly imprints world events, popular culture, and day-to-day conversations by millions of ordinary people at a scale that is scarcely conventionalized and recorded. Vitally, and absent from many standard corpora such as books and news archives, sharing and commenting mechanisms are native to social media platforms, enabling us to quantify social amplification (i.e., popularity) of trending storylines and contemporary cultural phenomena. Here, we describe Storywrangler, a natural language processing instrument designed to carry out an ongoing, day-scale curation of over 100 billion tweets containing roughly 1 trillion 1-grams from 2008 to 2021. For each day, we break tweets into unigrams, bigrams, and trigrams spanning over 100 languages. We track n-gram usage frequencies, and generate Zipf distributions, for words, hashtags, handles, numerals, symbols, and emojis. We make the data set available through an interactive time series viewer, and as downloadable time series and daily distributions. Although Storywrangler leverages Twitter data, our method of extracting and tracking dynamic changes of n-grams can be extended to any similar social media platform. We showcase a few examples of the many possible avenues of study we aim to enable including how social amplification can be visualized through 'contagiograms'. We also present some example case studies that bridge n-gram time series with disparate data sources to explore sociotechnical dynamics of famous individuals, box office success, and social unrest.