Skip to content

QLoRA: Efficient Finetuning of Quantized LLMs

https://huggingface.co/papers/2305.14314

https://arxiv.org/abs/2305.14314

https://github.com/artidoro/qlora


QLoRA:量化大语言模型的高效微调

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) Double Quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) Paged Optimizers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

我们提出了 QLoRA,一种高效的微调方法,它将内存使用降低到足以在单个 48GB GPU 上微调 65B 参数模型,同时保持完整的 16 位微调任务性能。QLoRA 通过一个冻结的、4 位量化的预训练语言模型将梯度反向传播到低秩适配器中。我们最好的模型系列命名为 Guanaco,在 Vicuna 基准上超越了所有先前公开发布的模型,达到了 ChatGPT 性能水平的 99.3%,而仅需在单个 GPU 上微调 24 小时。QLoRA 引入了一系列创新以在不牺牲性能的情况下节省内存:4 位 NormalFloat,一种信息理论上对正态分布权重最优的新数据类型;双重量化,通过对量化常数进行量化来减少平均内存占用;以及分页优化器,用于管理内存峰值。我们使用 QLoRA 微调了超过 1,000 个模型,对 8 个指令数据集、多种模型类型和模型规模下的指令跟随和聊天机器人性能进行了详细分析,这些分析对于常规微调而言是不可行的。我们的结果表明,在小型高质量数据集上进行 QLoRA 微调可以取得最先进的结果,即使使用比先前 SoTA 更小的模型。我们基于人工和 GPT-4 评估提供了聊天机器人性能的详细分析,表明 GPT-4 评估是人工评估的廉价且合理的替代方案。此外,我们发现当前的聊天机器人基准不可信,无法准确评估聊天机器人的性能水平。一项精心挑选的分析展示了 Guanaco 与 ChatGPT 相比失败的地方。我们发布了所有的模型和代码,包括用于 4 位训练的 CUDA 内核。


Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold

https://huggingface.co/papers/2305.10973

https://arxiv.org/abs/2305.10973

https://github.com/XingangPan/DragGAN


拖拽你的GAN:在生成式图像流形上的基于点的交互式操作

Synthesizing visual content that meets users' needs often requires flexible and precise controllability of the pose, shape, expression, and layout of the generated objects. Existing approaches gain controllability of generative adversarial networks (GANs) via manually annotated training data or a prior 3D model, which often lack flexibility, precision, and generality. In this work, we study a powerful yet much less explored way of controlling GANs, that is, to "drag" any points of the image to precisely reach target points in a user-interactive manner, as shown in Fig.1. To achieve this, we propose DragGAN, which consists of two main components: 1) a feature-based motion supervision that drives the handle point to move towards the target position, and 2) a new point tracking approach that leverages the discriminative generator features to keep localizing the position of the handle points. Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc. As these manipulations are performed on the learned generative image manifold of a GAN, they tend to produce realistic outputs even for challenging scenarios such as hallucinating occluded content and deforming shapes that consistently follow the object's rigidity. Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking. We also showcase the manipulation of real images through GAN inversion.

合成满足用户需求的视觉内容通常需要对生成物体的姿态、形状、表情和布局进行灵活且精确的控制。现有方法通过手动标注的训练数据或先验的3D模型来获得生成对抗网络的可控性,但这些方法往往缺乏灵活性、精确性和通用性。在本工作中,我们研究了一种强大但探索甚少的GAN控制方式,即以用户交互的方式将图像的任意点“拖动”到精确的目标点。为实现这一目标,我们提出了 DragGAN,它包含两个主要组件:基于特征的运动监督,驱动操作点向目标位置移动;以及一种新的点跟踪方法,利用判别性的生成器特征持续定位操作点的位置。通过 DragGAN,任何人都可以变形图像,精确控制像素的移动,从而操控动物、汽车、人物、风景等各类别物体的姿态、形状、表情和布局。由于这些操控是在GAN学习到的生成图像流形上进行的,即使对于具有挑战性的场景,如想象被遮挡的内容和遵循物体刚性的一致性形状变形,它们也倾向于产生逼真的输出。定性和定量比较均证明了DragGAN在图像操控和点跟踪任务上优于先前方法的优势。我们还通过GAN反演展示了真实图像的操控。


RWKV: Reinventing RNNs for the Transformer Era

https://huggingface.co/papers/2305.13048

https://arxiv.org/abs/2305.13048

https://github.com/BlinkDL/RWKV-LM


RWKV:在Transformer时代重塑RNN

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.

Transformer几乎革新了所有自然语言处理任务,但其内存和计算复杂度随序列长度呈二次方增长。相比之下,循环神经网络的内存和计算需求呈线性扩展,但由于并行化和可扩展性的限制,其性能难以与Transformer匹敌。我们提出了一种新颖的模型架构——接受加权键值,它结合了Transformer的高效并行训练和RNN的高效推理优势。

Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

我们的方法利用线性注意力机制,允许将模型构建为Transformer或RNN形式,从而在训练期间实现计算并行化,并在推理期间保持恒定的计算和内存复杂度。我们将模型规模扩展到140亿参数,这是迄今为止训练过的最大的稠密RNN,并发现RWKV的性能与同等规模的Transformer相当,这表明未来的工作可以利用这种架构创建更高效的模型。本工作在序列处理任务的计算效率与模型性能之间的权衡上迈出了重要一步。


Language models are increasingly being deployed for general problem solving across a wide range of tasks, but are still confined to token-level, left-to-right decision-making processes during inference. This means they can fall short in tasks that require exploration, strategic lookahead, or where initial decisions play a pivotal role. To surmount these challenges, we introduce a new framework for language model inference, "Tree of Thoughts" (ToT), which generalizes over the popular "Chain of Thought" approach to prompting language models, and enables exploration over coherent units of text ("thoughts") that serve as intermediate steps toward problem solving. ToT allows LMs to perform deliberate decision making by considering multiple different reasoning paths and self-evaluating choices to decide the next course of action, as well as looking ahead or backtracking when necessary to make global choices. Our experiments show that ToT significantly enhances language models' problem-solving abilities on three novel tasks requiring non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. For instance, in Game of 24, while GPT-4 with chain-of-thought prompting only solved 4% of tasks, our method achieved a success rate of 74%.

语言模型正越来越多地被部署用于跨广泛任务的通用问题求解,但在推理过程中,它们仍受限于逐词元、从左到右的决策过程。这意味着在需要探索、战略性前瞻或初始决策起关键作用的任务中,它们可能表现不佳。为了克服这些挑战,我们提出了一个新的语言模型推理框架——"思维树"(ToT),它推广了流行的"思维链"提示方法,使得能够探索作为问题求解中间步骤的连贯文本单元("思维")。ToT允许语言模型通过考虑多种不同的推理路径并自我评估选择来决定下一步行动,以及在必要时进行前瞻或回溯以做出全局选择。我们的实验表明,在三个需要非平凡规划或搜索的新任务上(24点游戏、创意写作和小型填字游戏),ToT显著提升了语言模型的问题求解能力。例如,在24点游戏中,采用思维链提示的GPT-4仅解决了4%的任务,而我们的方法达到了74%的成功率。