论文标题
推荐系统中的质量指标:我们是否始终如一地计算指标?
Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?
论文作者
论文摘要
离线评估是一种流行的方法,可以根据所选质量指标确定最佳算法。但是,如果所选的度量计算出意外的事情,则这种误导会导致不良的决定和错误的结论。在本文中,我们彻底研究用于推荐系统评估的质量指标。我们查看现代Recsys库中发现的实施的实际方面以及学术论文中定义的理论方面。我们发现,精度是论文和库中唯一普遍理解的指标,而其他指标可能具有不同的解释。在不同库中实施的指标有时具有相同的名称,但要测量不同的内容,从而导致不同的结果给定相同的输入。在学术论文中定义指标时,作者有时会省略明确的表述或提供不包含解释的参考文献。在47%的情况下,我们无法轻易知道如何定义度量,因为定义尚不清楚或不存在。这些发现突出了推荐系统评估的另一个困难,并呼吁对评估协议进行更详细的描述。
Offline evaluation is a popular approach to determine the best algorithm in terms of the chosen quality metric. However, if the chosen metric calculates something unexpected, this miscommunication can lead to poor decisions and wrong conclusions. In this paper, we thoroughly investigate quality metrics used for recommender systems evaluation. We look at the practical aspect of implementations found in modern RecSys libraries and at the theoretical aspect of definitions in academic papers. We find that Precision is the only metric universally understood among papers and libraries, while other metrics may have different interpretations. Metrics implemented in different libraries sometimes have the same name but measure different things, which leads to different results given the same input. When defining metrics in an academic paper, authors sometimes omit explicit formulations or give references that do not contain explanations either. In 47% of cases, we cannot easily know how the metric is defined because the definition is not clear or absent. These findings highlight yet another difficulty in recommender system evaluation and call for a more detailed description of evaluation protocols.