论文标题
评估生物医学领域的稀疏可解释的单词嵌入
Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain
论文作者
论文摘要
单词嵌入已经进入了各种自然语言处理任务,包括生物医学领域的任务。尽管这些向量表示成功捕获了语义和句法单词关系,数据中隐藏的模式和趋势,但它们无法提供可解释性。可解释性是理由的关键手段,这是生物医学应用方面不可或缺的一部分。我们介绍了一项关于医学领域单词嵌入的解释性的包容性研究,重点是稀疏方法的作用。提供了用于单词矢量表示的可解释性的定性和定量测量和指标。对于定量评估,我们介绍了一个广泛的分类数据集,该数据集可用于基于类别理论来量化可解释性。还提出了研究方法的内在和外在评估。至于后者,我们提出的数据集可用于有效地对生物医学领域中的单词向量进行外部评估。根据我们的实验,可以看出稀疏的单词向量显示出更大的解释性,同时保留其在下游任务中其原始向量的性能。
Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden patterns and trends in the data, they fail to offer interpretability. Interpretability is a key means to justification which is an integral part when it comes to biomedical applications. We present an inclusive study on interpretability of word embeddings in the medical domain, focusing on the role of sparse methods. Qualitative and quantitative measurements and metrics for interpretability of word vector representations are provided. For the quantitative evaluation, we introduce an extensive categorized dataset that can be used to quantify interpretability based on category theory. Intrinsic and extrinsic evaluation of the studied methods are also presented. As for the latter, we propose datasets which can be utilized for effective extrinsic evaluation of word vectors in the biomedical domain. Based on our experiments, it is seen that sparse word vectors show far more interpretability while preserving the performance of their original vectors in downstream tasks.