GLM-130b：开放双语的预训练模型

论文标题

GLM-130b：开放双语的预训练模型

GLM-130B: An Open Bilingual Pre-trained Model

论文作者

Zeng, Aohan, Liu, Xiao, Du, Zhengxiao, Wang, Zihan, Lai, Hanyu, Ding, Ming, Yang, Zhuoyi, Xu, Yifan, Zheng, Wendi, Xia, Xiao, Tam, Weng Lam, Ma, Zixuan, Xue, Yufei, Zhai, Jidong, Chen, Wenguang, Zhang, Peng, Dong, Yuxiao, Tang, Jie

论文摘要

我们介绍了GLM-130B，这是一种具有1300亿个参数的双语（英语和中文）预培训的语言模型。这是一种尝试至少与GPT-3（Davinci）一样好的开放源模型，并揭示了如何成功培训这种规模的模型。在这项工作的过程中，我们面临着许多意外的技术和工程挑战，尤其是在损失峰值和分歧方面。在本文中，我们介绍了GLM-130B的培训过程，包括其设计选择，效率和稳定性的培训策略以及工程工作。最终的GLM-130B型号在广泛流行的英语基准上提供了比GPT-3 175B（Davinci）的高表现，而在Opt-175b和Bloom-176b中未观察到性能优势。在相关基准中，它也一致地优结于Ernie Titan 3.0 260B（最大的中文模型）。最后，我们利用GLM-130B的独特缩放属性无需培训即可到达INT4量化，几乎没有性能损失，这使其成为100B级级型号中的第一个，更重要的是，它可以有效推断4 $ \ times $ \ times $ rtx 3090（24G）（24G）或8 $ \ times $ \ times $ rtx 2080 TI（11G）TI（11G）TI（11G）型号，可用于最多可使用100 gpus gpus。 GLM-130B模型权重可以公开访问，其代码，培训日志，相关工具包和经验教训是通过\ url {https://github.com/thudm/glm/glm-130b/}开源的。

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numerous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stability, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B -- the largest Chinese language model -- across related benchmarks. Finally, we leverage a unique scaling property of GLM-130B to reach INT4 quantization without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4$\times$RTX 3090 (24G) or 8$\times$RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at \url{https://github.com/THUDM/GLM-130B/}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题