揭开黑暗艺术的神秘面纱：了解现实世界的机器学习模型开发

论文标题

揭开黑暗艺术的神秘面纱：了解现实世界的机器学习模型开发

Demystifying a Dark Art: Understanding Real-World Machine Learning Model Development

论文作者

Lee, Angela, Xin, Doris, Lee, Doris, Parameswaran, Aditya

论文摘要

众所周知，开发机器学习（ML）工作流的过程是一个黑暗的艺术。甚至专家也很难找到最佳的工作流程，从而导致高精度模型。用户目前依靠经验反复试验来获取自己的一套由战斗测试的准则，以告知其建模决策。在这项研究中，我们旨在通过了解人们如何在实践中迭代ML工作流程来揭开这项黑暗艺术的神秘面纱。我们在OpenML上分析了超过47.5万用户生成的工作流，这是一个用于跟踪和共享ML工作流的开源平台。我们发现，在迭代工作流程时，用户通常会采用手册，自动化或混合方法。我们观察到与自动化方法相比，手动方法会导致浪费较少。然而，自动化方法通常涉及更多的预处理和超参数选项，从而产生了更高的性能总体 - 对人类在循环的ML系统的潜在利益，该系统适当建议对两种策略进行巧妙的组合。

It is well-known that the process of developing machine learning (ML) workflows is a dark-art; even experts struggle to find an optimal workflow leading to a high accuracy model. Users currently rely on empirical trial-and-error to obtain their own set of battle-tested guidelines to inform their modeling decisions. In this study, we aim to demystify this dark art by understanding how people iterate on ML workflows in practice. We analyze over 475k user-generated workflows on OpenML, an open-source platform for tracking and sharing ML workflows. We find that users often adopt a manual, automated, or mixed approach when iterating on their workflows. We observe that manual approaches result in fewer wasted iterations compared to automated approaches. Yet, automated approaches often involve more preprocessing and hyperparameter options explored, resulting in higher performance overall--suggesting potential benefits for a human-in-the-loop ML system that appropriately recommends a clever combination of the two strategies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题