论文标题

COVID-19的高性能采矿开放研究数据集用于文本分类和云计算环境中的见解

High-Performance Mining of COVID-19 Open Research Datasets for Text Classification and Insights in Cloud Computing Environments

论文作者

Zhao, Jie, Rodriguez, Maria A., Buyya, Rajkumar

论文摘要

Covid-19全球大流行是一种前所未有的健康危机。自爆发以来,世界上许多研究人员都产生了广泛的文献收藏。对于研究社区和公众消化,至关重要的是分析文本并及时提供见解,这需要相当多的计算能力。近年来,在学术界和行业中,云计算已被广泛采用。特别是,由于其两倍的好处,混合云正在越来越受欢迎:利用现有资源来节省成本并使用其他云服务提供商来获得评估,从而按需额外的计算资源。在本文中,我们开发了一个系统,该系统利用ANEKA PAAS中间件具有并行处理和多云功能,以在混合云上使用机器学习技术加速ETL和文章对过程进行分类。然后将结果持续进行进一步的参考,搜索和可视化。我们的绩效评估表明,该系统可以帮助减少处理时间并实现线性可扩展性。除了Covid-19,该应用程序还可以直接用于更广泛的学术文章索引和分析中。

COVID-19 global pandemic is an unprecedented health crisis. Since the outbreak, many researchers around the world have produced an extensive collection of literatures. For the research community and the general public to digest, it is crucial to analyse the text and provide insights in a timely manner, which requires a considerable amount of computational power. Clouding computing has been widely adopted in academia and industry in recent years. In particular, hybrid cloud is gaining popularity since its two-fold benefits: utilising existing resource to save cost and using additional cloud service providers to gain assess to extra computing resources on demand. In this paper, we developed a system utilising the Aneka PaaS middleware with parallel processing and multi-cloud capability to accelerate the ETL and article categorising process using machine learning technology on a hybrid cloud. The result is then persisted for further referencing, searching and visualising. Our performance evaluation shows that the system can help with reducing processing time and achieving linear scalability. Beyond COVID-19, the application might be used directly in broader scholarly article indexing and analysing.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源