Skip to content

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

https://huggingface.co/papers/2402.17764

https://arxiv.org/abs/2402.17764


1-bit大语言模型时代:所有大语言模型都可压缩至1.58比特

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

近期研究为1-bit大语言模型的新纪元铺平了道路。在这项工作中,我们引入了一个1-bit大语言模型的变体,即 BitNet b1.58,其中大语言模型的每个参数都是三元值。它在困惑度和最终任务性能方面,与相同模型规模和训练词元的全精度Transformer大语言模型相当,同时在延迟、内存、吞吐量和能耗方面显著更具成本效益。更深远的意义在于,1.58-bit大语言模型定义了一种新的扩展法则和训练配方,用于训练新一代既高性能又成本效益的大语言模型。此外,它启用了一种新的计算范式,并为设计针对1-bit大语言模型优化的特定硬件打开了大门。


Mixtures of Experts Unlock Parameter Scaling for Deep RL

https://huggingface.co/papers/2402.08609

https://arxiv.org/abs/2402.08609

https://github.com/google/dopamine

Google

混合专家解锁深度强化学习的参数扩展

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

近期(自)监督学习模型的快速进展在很大程度上可由经验缩放定律预测:模型的性能与其规模成比例缩放。然而,对于强化学习领域,类似的缩放定律仍然难以捉摸,在该领域中,增加模型的参数数量往往会损害其最终性能。在本文中,我们证明将混合专家模块,特别是软 MoE,融入基于价值的网络能够产生更具参数可扩展性的模型,这在多种训练机制和模型规模下的显著性能提升中得到了证实。因此,这项工作为制定强化学习的缩放定律提供了有力的实证证据。