论文标题
X2VEC可以挽救生命吗?集成自动心理健康分类的图形和语言嵌入
Can x2vec Save Lives? Integrating Graph and Language Embeddings for Automatic Mental Health Classification
论文作者
论文摘要
鉴于它们能够在低维空间中代表复杂的稀疏数据,因此在大规模分析中,图形和语言嵌入模型变得司空见惯。如果预测稀有事件或对隐藏人群的成员进行分类 - 需要大量且稀疏的数据集以进行通用分析,则将这些模型的互补关系和交流数据整合起来可能特别有用。例如,由于社会污名和合并症,心理健康支持小组经常在无定形的在线组中形成。由于资源限制(例如,内存),使用标准网络分析来预测这些设置中个人之间的自杀性,并且将诸如文本之类的辅助数据添加到此类模型中加剧了与稀疏性相关的问题。在这里,我展示了如何合并图形和语言嵌入模型(Metapath2VEC和DOC2VEC)避免这些限制,并提取无域专业知识或功能工程的无监督聚类数据。自杀支持组的图形和语言距离几乎没有相关性(\ r {ho} <0.23),这意味着这两个模型没有嵌入冗余信息。当单独用于预测个体自杀性时,图和语言数据会产生相对准确的结果(分别为69%和76%);但是,在整合时,两个数据都会产生高度准确的预测(90%,假阳性10%和12%的假阴性)。可视化图的嵌入方式,并用潜在自杀个体的预测进行了预测,即使综合模型远离支持组,综合模型也可以对这些个体进行分类。这些结果扩展了对大规模网络中同时分析行为和语言的重要性的研究,并在预测和分类时努力整合不同类型数据的嵌入模型,尤其是在涉及罕见事件时。
Graph and language embedding models are becoming commonplace in large scale analyses given their ability to represent complex sparse data densely in low-dimensional space. Integrating these models' complementary relational and communicative data may be especially helpful if predicting rare events or classifying members of hidden populations - tasks requiring huge and sparse datasets for generalizable analyses. For example, due to social stigma and comorbidities, mental health support groups often form in amorphous online groups. Predicting suicidality among individuals in these settings using standard network analyses is prohibitive due to resource limits (e.g., memory), and adding auxiliary data like text to such models exacerbates complexity- and sparsity-related issues. Here, I show how merging graph and language embedding models (metapath2vec and doc2vec) avoids these limits and extracts unsupervised clustering data without domain expertise or feature engineering. Graph and language distances to a suicide support group have little correlation (\r{ho} < 0.23), implying the two models are not embedding redundant information. When used separately to predict suicidality among individuals, graph and language data generate relatively accurate results (69% and 76%, respectively); however, when integrated, both data produce highly accurate predictions (90%, with 10% false-positives and 12% false-negatives). Visualizing graph embeddings annotated with predictions of potentially suicidal individuals shows the integrated model could classify such individuals even if they are positioned far from the support group. These results extend research on the importance of simultaneously analyzing behavior and language in massive networks and efforts to integrate embedding models for different kinds of data when predicting and classifying, particularly when they involve rare events.