论文标题

对骨头不好:核心的大活跃数据

BAD to the Bone: Big Active Data at its Core

论文作者

Jacobs, Steven, Wang, Xikui, Carey, Michael J., Tsotras, Vassilis J., Uddin, Md Yusuf Sarwar

论文摘要

几乎所有当今的大数据系统本质上都是被动的,响应用户发布的查询。相反,我们正在努力将大数据平台从被动转移到活动。在我们看来,一个大型积极数据(不良)系统应连续且可靠地捕获大数据,同时及时并自动向大量感兴趣的用户提供相关信息,并支持对历史信息的回顾性分析。尽管已经创建了各种可扩展的流媒体查询引擎,但它们的主动行为仅限于传入数据的(相对)小窗口。为此,我们创建了一个糟糕的平台,该平台结合了来自大数据和活动数据的想法和功能(例如,发布/订阅,流媒体引擎)。它支持复杂的订阅,不仅考虑了新到达的项目,还考虑了与过去存储的数据的关系。此外,它可以通过使用其他有用数据丰富订阅结果来提供可行的通知。我们的平台使用Active Toolkit扩展了现有的开源大数据管理系统Apache AsterixDB。该工具包包含迅速摄入半结构数据的功能,在用户之间共享执行管道,管理规模的用户数据订阅,并积极监视数据状态以为每个用户提供个性化信息。本文介绍了我们当前不良数据平台的功能和设计,并展示了其扩展的能力,而无需牺牲查询功能或结果个性化。

Virtually all of today's Big Data systems are passive in nature, responding to queries posted by their users. Instead, we are working to shift Big Data platforms from passive to active. In our view, a Big Active Data (BAD) system should continuously and reliably capture Big Data while enabling timely and automatic delivery of relevant information to a large pool of interested users, as well as supporting retrospective analyses of historical information. While various scalable streaming query engines have been created, their active behavior is limited to a (relatively) small window of the incoming data. To this end we have created a BAD platform that combines ideas and capabilities from both Big Data and Active Data (e.g., Publish/Subscribe, Streaming Engines). It supports complex subscriptions that consider not only newly arrived items but also their relationships to past, stored data. Further, it can provide actionable notifications by enriching the subscription results with other useful data. Our platform extends an existing open-source Big Data Management System, Apache AsterixDB, with an active toolkit. The toolkit contains features to rapidly ingest semistructured data, share execution pipelines among users, manage scaled user data subscriptions, and actively monitor the state of the data to produce individualized information for each user. This paper describes the features and design of our current BAD data platform and demonstrates its ability to scale without sacrificing query capabilities or result individualization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源