分层GPT具有一致的变压器的多句子语言模型

论文标题

分层GPT具有一致的变压器的多句子语言模型

Hierarchical GPT with Congruent Transformers for Multi-Sentence Language Models

论文作者

Roh, Jihyeon, Gim, Huiseong, Lee, Soo-Young

论文摘要

我们报告了一种基于GPT的多句子语言模型，用于对话生成和文档理解。首先，我们提出了一个分层GPT，该分层由三个块组成，即编码块，一个生成块的句子和一个句子解码块。编码和解码块的句子基本上是标准变压器的编码器块，它们独立地在每个句子上工作。句子生成块是在编码和解码块之间插入的，并从上一个句子嵌入向量中生成了嵌入向量的下一个句子。我们认为，这是人类进行对话并理解段落和文件的方式。由于每个句子可能由较少的单词组成，因此编码和解码变压器的句子可以使用较小的维嵌入向量。其次，我们注意到变压器中的注意力利用了内部产品相似性度量。因此，为了比较同一空间中的两个向量，我们将查询和键的转换矩阵设置为相同。否则，相似性概念是不一致的。我们报告实验结果，以表明这两种修改会增加具有多个句子的任务的语言模型性能。

We report a GPT-based multi-sentence language model for dialogue generation and document understanding. First, we propose a hierarchical GPT which consists of three blocks, i.e., a sentence encoding block, a sentence generating block, and a sentence decoding block. The sentence encoding and decoding blocks are basically the encoder-decoder blocks of the standard Transformers, which work on each sentence independently. The sentence generating block is inserted between the encoding and decoding blocks, and generates the next sentence embedding vector from the previous sentence embedding vectors. We believe it is the way human make conversation and understand paragraphs and documents. Since each sentence may consist of fewer words, the sentence encoding and decoding Transformers can use much smaller dimensional embedding vectors. Secondly, we note the attention in the Transformers utilizes the inner-product similarity measure. Therefore, to compare the two vectors in the same space, we set the transform matrices for queries and keys to be the same. Otherwise, the similarity concept is incongruent. We report experimental results to show that these two modifications increase the language model performance for tasks with multiple sentences.

下载PDF全文

下载文献需遵守相关版权规定

论文标题