论文标题
珊瑚:使用弱监督变压器的代码表示学习,用于分析数据分析
CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis
论文作者
论文摘要
源代码,尤其是科学源代码的大规模分析具有更好地理解数据科学过程,确定分析最佳实践并为科学工具包的建筑商提供见解的希望。但是,由于缺少描述性标签,需要专家领域知识来产生大型语料库。我们提出了一种新颖的基于变压器的弱监督的架构,用于从抽象语法树和周围的自然语言评论中计算代码的联合表示。然后,我们在新的分类任务上评估该模型,以将计算笔记本电池单元标记为数据分析过程中的阶段,从数据导入到争吵,勘探,建模和评估。我们表明,我们的模型仅利用易于利用的薄弱监督,在专家提供的启发式方法上的准确性提高了38%,并且表现优于一套基线。我们的模型使我们能够检查一组118,000个Jupyter笔记本电脑,以发现共同的数据分析模式。为了关注与学术文章有关系的笔记本,我们进行了有史以来最大的科学守则研究,发现笔记本组成与相应论文的引用数量相关。
Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We then evaluate the model on a new classification task for labeling computational notebook cells as stages in the data analysis process from data import to wrangling, exploration, modeling, and evaluation. We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplied heuristics and outperforms a suite of baselines. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest ever study of scientific code and find that notebook composition correlates with the citation count of corresponding papers.