论文标题
在评估可解释的AI系统时,代理任务和主观措施可能会误导
Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems
论文作者
论文摘要
可解释的人工智能(XAI)系统构成了社会技术系统的一部分,例如人类+AI团队,负责做出决策。但是,很少通过测量人类+AI团队在实际决策任务上的性能来评估当前的XAI系统。我们进行了两项在线实验和一项面对面的思考研究研究,以评估目前评估XAI系统的两种常见技术:(1)使用代理,人工任务,例如人类从给定的解释中预测AI的决定,以及(2)使用信任和偏好作为实际绩效的预测者(2)。我们的实验结果表明,具有代理任务的评估并未通过实际的决策任务来预测评估的结果。此外,通过实际决策任务进行评估的主观措施并不能预测相同任务的客观绩效。我们的结果表明,通过采用误导性评估方法,我们的领域可能会无意中放缓其在发展人类+AI团队方面的进步,而人类+AI团队可以比人类或单独的AIS可靠地表现更好。
Explainable artificially intelligent (XAI) systems form part of sociotechnical systems, e.g., human+AI teams tasked with making decisions. Yet, current XAI systems are rarely evaluated by measuring the performance of human+AI teams on actual decision-making tasks. We conducted two online experiments and one in-person think-aloud study to evaluate two currently common techniques for evaluating XAI systems: (1) using proxy, artificial tasks such as how well humans predict the AI's decision from the given explanations, and (2) using subjective measures of trust and preference as predictors of actual performance. The results of our experiments demonstrate that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks. Further, the subjective measures on evaluations with actual decision-making tasks did not predict the objective performance on those same tasks. Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.