The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

https://huggingface.co/papers/2402.17764

https://arxiv.org/abs/2402.17764

1-bit大语言模型时代：所有大语言模型都可压缩至1.58比特

Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single parameter (or weight) of the LLM is ternary {-1, 0, 1}. It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. Furthermore, it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.

近期研究为1-bit大语言模型的新纪元铺平了道路。在这项工作中，我们引入了一个1-bit大语言模型的变体，即 BitNet b1.58，其中大语言模型的每个参数都是三元值。它在困惑度和最终任务性能方面，与相同模型规模和训练词元的全精度Transformer大语言模型相当，同时在延迟、内存、吞吐量和能耗方面显著更具成本效益。更深远的意义在于，1.58-bit大语言模型定义了一种新的扩展法则和训练配方，用于训练新一代既高性能又成本效益的大语言模型。此外，它启用了一种新的计算范式，并为设计针对1-bit大语言模型优化的特定硬件打开了大门。

Mixtures of Experts Unlock Parameter Scaling for Deep RL

https://huggingface.co/papers/2402.08609

https://arxiv.org/abs/2402.08609

https://github.com/google/dopamine

Google

混合专家解锁深度强化学习的参数扩展

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

近期（自）监督学习模型的快速进展在很大程度上可由经验缩放定律预测：模型的性能与其规模成比例缩放。然而，对于强化学习领域，类似的缩放定律仍然难以捉摸，在该领域中，增加模型的参数数量往往会损害其最终性能。在本文中，我们证明将混合专家模块，特别是软 MoE，融入基于价值的网络能够产生更具参数可扩展性的模型，这在多种训练机制和模型规模下的显著性能提升中得到了证实。因此，这项工作为制定强化学习的缩放定律提供了有力的实证证据。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

1-bit大语言模型时代：所有大语言模型都可压缩至1.58比特

Mixtures of Experts Unlock Parameter Scaling for Deep RL

混合专家解锁深度强化学习的参数扩展

🤖 Rasa

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits ​

1-bit大语言模型时代：所有大语言模型都可压缩至1.58比特 ​

Mixtures of Experts Unlock Parameter Scaling for Deep RL ​

混合专家解锁深度强化学习的参数扩展 ​

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

1-bit大语言模型时代：所有大语言模型都可压缩至1.58比特

Mixtures of Experts Unlock Parameter Scaling for Deep RL

混合专家解锁深度强化学习的参数扩展