关于使用众包，网络剪贴或其他不常规收集的数据估算的空间滞后模型

论文标题

关于使用众包，网络剪贴或其他不常规收集的数据估算的空间滞后模型

On Spatial Lag Models estimated using crowdsourcing, web-scraping or other unconventionally collected data

论文作者

Arbia, Giuseppe, Nardelli, Vincenzo

论文摘要

大数据革命正在挑战最先进的统计和计量经济学技术，这不仅是针对与数据产生的大量和速度相关的计算负担，而且对于收集数据的各种来源而言，更多的东西（Arbia，2021）。本文专门针对这最后一个方面。非传统大数据源的常见示例由众包（个人自愿收集的数据）和网络刮擦（从网站提取并在结构化数据集中重塑）来表示。这些非常规数据收集的常见特征是缺乏任何精确的统计样本设计，这种情况在统计中描述为“便利抽样”。众所周知，在这些条件下，不可能推断。为了克服这个问题，Arbia等人。（2018年）提出了一种特殊形式的分层后（称为“后采样”）的使用，该数据在推断上下文中使用了数据。在本文中，我们使用相同的想法概括了这种方法来估计空间滞后模型（SLM）。我们开始通过一项蒙特卡洛研究来展示，该研究使用没有适当设计的数据收集的数据，参数的估计值可能会偏差。其次，我们提出了一种解决此问题的后抽样策略。我们表明，所提出的策略确实实现了降低偏见，但以估计器差异随之增加的代价。因此，我们建议采用MSE校正操作策略。该论文还包含了后采样程序所隐含的方差增加的正式推导，并以该方法在米兰市使用Web刮擦数据在米兰市估算享乐价格模型的经验应用。

The Big Data revolution is challenging the state-of-the-art statistical and econometric techniques not only for the computational burden connected with the high volume and speed which data are generated, but even more for the variety of sources through which data are collected (Arbia, 2021). This paper concentrates specifically on this last aspect. Common examples of non traditional Big Data sources are represented by crowdsourcing (data voluntarily collected by individuals) and web scraping (data extracted from websites and reshaped in a structured dataset). A common characteristic to these unconventional data collections is the lack of any precise statistical sample design, a situation described in statistics as 'convenience sampling'. As it is well known, in these conditions no probabilistic inference is possible. To overcome this problem, Arbia et al. (2018) proposed the use of a special form of post-stratification (termed 'post-sampling'), with which data are manipulated prior their use in an inferential context. In this paper we generalize this approach using the same idea to estimate a Spatial Lag Model (SLM). We start showing through a Monte Carlo study that using data collected without a proper design, parameters' estimates can be biased. Secondly, we propose a post sampling strategy to tackle this problem. We show that the proposed strategy indeed achieves a bias-reduction, but at the price of a concomitant increase in the variance of the estimators. We thus suggest an MSE-correction operational strategy. The paper also contains a formal derivation of the increase in variance implied by the post-sampling procedure and concludes with an empirical application of the method in the estimation of a hedonic price model in the city of Milan using web scraped data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题