带有侧面数据的文本主题分析

论文标题

带有侧面数据的文本主题分析

Topic Analysis for Text with Side Data

论文作者

Fang, Biyi, Rajshekhar, Kripa, Klabjan, Diego

论文摘要

尽管潜在因子模型（例如矩阵分解）在预测中获得良好的性能，但它们遭受了几个问题，包括寒冷，非透明性和次优建议。在本文中，我们使用带有侧面数据的文本来应对这些限制。我们介绍了一个混合生成概率模型，该模型将神经网络与潜在主题模型相结合，该模型是一个四级分层贝叶斯模型。在模型中，每个文档都在一组基本主题上建模为有限混合物，并且每个主题在一组基本的主题概率集上被建模为无限混合物。此外，每个主题概率在侧面数据上都被建模为有限混合物。在文本的上下文中，神经网络提供了有关相应文本的侧面数据的概述分布，这是LDA中的先前分布，可帮助执行主题分组。该方法在几个不同的数据集上进行了评估，在该数据集上，该模型在主题分组，模型困惑，分类和评论生成方面都超过了标准LDA和DIRICHLET-MULTINOMIAL回归（DMR）。

Although latent factor models (e.g., matrix factorization) obtain good performance in predictions, they suffer from several problems including cold-start, non-transparency, and suboptimal recommendations. In this paper, we employ text with side data to tackle these limitations. We introduce a hybrid generative probabilistic model that combines a neural network with a latent topic model, which is a four-level hierarchical Bayesian model. In the model, each document is modeled as a finite mixture over an underlying set of topics and each topic is modeled as an infinite mixture over an underlying set of topic probabilities. Furthermore, each topic probability is modeled as a finite mixture over side data. In the context of text, the neural network provides an overview distribution about side data for the corresponding text, which is the prior distribution in LDA to help perform topic grouping. The approach is evaluated on several different datasets, where the model is shown to outperform standard LDA and Dirichlet-multinomial regression (DMR) in terms of topic grouping, model perplexity, classification and comment generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题