Jamba: A Hybrid Transformer-Mamba Language Model
Jamba:一种混合Transformer-Mamba语言模型
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.
我们提出了 Jamba,一个基于新颖的混合 Transformer-Mamba 混合专家架构的基础大语言模型。具体来说,Jamba 将 Transformer 层和 Mamba 层的块交错排列,从而享受两种模型家族的优势。我们在其中一些层中加入了 MoE,以在保持激活参数使用可控的同时增加模型容量。这种灵活的架构允许针对资源和目标进行特定配置。在我们已实现的特定配置中,我们最终得到了一个强大的模型,它可以装入单个 80GB GPU。在大规模构建下,与普通 Transformer 相比,Jamba 提供了高吞吐量和较小的内存占用,同时在标准语言模型基准和长上下文评估上达到了最先进的性能。值得注意的是,该模型在高达 25.6 万词元的上下文长度上都展现出强劲的结果。我们研究了各种架构决策,例如如何结合 Transformer 层和 Mamba 层,以及如何混合专家,并表明其中一些决策对于大规模建模至关重要。我们还描述了 Jamba 的训练和评估所揭示的这些架构的几个有趣特性,并计划发布来自各种消融运行的检查点,以鼓励对这种新颖架构的进一步探索。我们在宽松的许可下公开发布了我们实现的 Jamba 的权重。
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
LlamaFactory:100多种语言模型的统一高效微调
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LLAMAFACTORY, a unified framework that integrates a suite of cutting-edge efficient training methods. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LLAMABOARD. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks.
高效微调对于使大语言模型适应下游任务至关重要。然而,在不同模型上实现这些方法需要付出不小的努力。我们提出了 LLAMAFACTORY,一个集成了系列尖端高效训练方法的统一框架。它通过内置的 Web UI LLAMABOARD,提供了一种无需编码即可灵活定制超过 100 种大语言模型微调的解决方案。我们在语言建模和文本生成任务上实证验证了我们框架的效率和有效性。
AIOS: LLM Agent Operating System
LLM-based intelligent agents face significant deployment challenges, particularly related to resource management. Allowing unrestricted access to LLM or tool resources can lead to inefficient or even potentially harmful resource allocation and utilization for agents. Furthermore, the absence of proper scheduling and resource management mechanisms in current agent designs hinders concurrent processing and limits overall system efficiency. To address these challenges, this paper proposes the architecture of AIOS (LLM-based AI Agent Operating System) under the context of managing LLM-based agents. It introduces a novel architecture for serving LLM-based agents by isolating resources and LLM-specific services from agent applications into an AIOS kernel. This AIOS kernel provides fundamental services (e.g., scheduling, context management, memory management, storage management, access control) for runtime agents. To enhance usability, AIOS also includes an AIOS SDK, a comprehensive suite of APIs designed for utilizing functionalities provided by the AIOS kernel. Experimental results demonstrate that using AIOS can achieve up to
基于LLM的智能体面临着显著的部署挑战,尤其是在资源管理方面。允许无限制地访问LLM或工具资源,可能导致智能体资源分配和利用效率低下,甚至可能有害。此外,当前智能体设计中缺乏适当的调度和资源管理机制,阻碍了并发处理,限制了整体系统效率。为应对这些挑战,本文在管理基于LLM的智能体的背景下,提出了AIOS的架构。它引入了一种新颖的架构来服务于基于LLM的智能体,通过将资源和LLM特定服务从智能体应用中分离出来,放入AIOS内核中。该AIOS内核为运行时智能体提供基础服务。为增强可用性,AIOS还包含一个AIOS SDK,这是一套全面的API,用于利用AIOS内核提供的功能。实验结果表明,使用AIOS可为各种智能体框架构建的智能体服务实现高达 2.1 倍的执行速度提升。
RAFT: Adapting Language Model to Domain Specific RAG
RAFT:使语言模型适应特定领域的RAG
Pretraining Large Language Models (LLMs) on large corpora of textual data is now a standard paradigm. When using these LLMs for many downstream applications, it is common to additionally incorporate new information into the pretrained model either through RAG-based-prompting, or finetuning. However, the best methodology to incorporate information remains an open question. In this paper, we present Retrieval Augmented Fine Tuning (RAFT), a training recipe which improves the model's ability to answer questions in "open-book" in-domain settings. In training RAFT, given a question, and a set of retrieved documents, we train the model to ignore those documents that don't help in answering the question, which we call, distractor documents. RAFT accomplishes this by citing verbatim the right sequence from the relevant document to help answer the question. This coupled with RAFT's chain-of-thought-style response helps improve the model's ability to reason. In domain specific RAG, RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets, presenting a post-training recipe to improve pre-trained LLMs to in-domain RAG.
在海量文本语料库上预训练大语言模型现已成为标准范式。当将这些大语言模型用于许多下游应用时,通常需要通过基于RAG的提示或微调来将新信息融入预训练模型。然而,融入信息的最佳方法仍是一个悬而未决的问题。在本文中,我们提出了检索增强微调,一种能够提升模型在特定领域“开卷”环境中回答问题能力的训练方法。在训练 RAFT 时,给定一个问题以及一组检索到的文档,我们训练模型忽略那些无助于回答问题的文档,我们称之为干扰文档。RAFT 通过逐字引用相关文档中的正确序列来帮助回答问题,从而实现这一点。这与 RAFT 的思维链式响应相结合,有助于提升模型的推理能力。在特定领域的 RAG 中,RAFT 在 PubMed、HotpotQA 和 Gorilla 数据集上持续提升了模型性能,提出了一种将预训练大语言模型改进为特定领域 RAG 的后训练方法。