论文标题
贝叶斯分层伯努利 - 韦布尔混合模型,用于极罕见的事件
Bayesian Hierarchical Bernoulli-Weibull Mixture Model for Extremely Rare Events
论文作者
论文摘要
估计用户行为的持续时间是大多数互联网公司的核心问题。生存分析是一种有前途的方法,用于分析事件的预期持续时间,通常假定所有受试者的生存函数,从长远来看,事件将发生。但是,当用户的行为不同或某些用户的某些事件(即轻度用户的Web服务上的转换期)而无意在服务上行事时,这种假设是不合适的。特别是,如果非活动用户的比例很高,则此假设可能会导致不良结果。为了应对这些挑战,本文提出了一个混合模型,该模型分别解决了具有潜在变量的主动和不活跃的个体。首先,我们定义了这个特定的问题设定,并在解决此问题时显示了常规生存分析的局限性。我们证明了我们的Bernoulli-Weibull模型自然可以满足挑战。提出的模型进一步扩展到贝叶斯分层模型,以结合每个受试者的参数,从WAIC和WBIC方面,对常规的非分层模型提供了实质性改进。其次,使用日本求职网站的现实世界数据进行了实验和广泛的分析,该网站CareerTrek,由Bizreach,Inc。提供的CareerTrek。在分析中提出了一些研究问题,例如用户类别之间的激活率和转换率的差异,以及随着时间的流逝,事件发生率的即时发生率。定量答案和解释已分配给他们。此外,该模型是以贝叶斯方式推断的,这使我们能够用参数和预测数量的可靠间隔表示不确定性。
Estimating the duration of user behavior is a central concern for most internet companies. Survival analysis is a promising method for analyzing the expected duration of events and usually assumes the same survival function for all subjects and the event will occur in the long run. However, such assumptions are inappropriate when the users behave differently or some events never occur for some users, i.e., the conversion period on web services of the light users with no intention of behaving actively on the service. Especially, if the proportion of inactive users is high, this assumption can lead to undesirable results. To address these challenges, this paper proposes a mixture model that separately addresses active and inactive individuals with a latent variable. First, we define this specific problem setting and show the limitations of conventional survival analysis in addressing this problem. We demonstrate how naturally our Bernoulli-Weibull model can accommodate the challenge. The proposed model was extended further to a Bayesian hierarchical model to incorporate each subject's parameter, offering substantial improvements over conventional, non-hierarchical models in terms of WAIC and WBIC. Second, an experiment and extensive analysis were conducted using real-world data from the Japanese job search website, CareerTrek, offered by BizReach, Inc. In the analysis, some research questions are raised, such as the difference in activation rate and conversion rate between user categories, and how instantaneously the rate of event occurrence changes as time passes. Quantitative answers and interpretations are assigned to them. Furthermore, the model is inferred in a Bayesian manner, which enables us to represent the uncertainty with a credible interval of the parameters and predictive quantities.