Skip to content

LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models

https://huggingface.co/papers/2309.12307

https://arxiv.org/abs/2309.12307

https://github.com/JIA-Lab-research/LongLoRA


LongLoRA:面向长上下文大语言模型的高效微调

We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16× computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shifted sparse attention (S2-Attn) effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA combines this improved LoRA with S2-Attn. LongLoRA demonstrates strong empirical results on various tasks on Llama2 models from 7B/13B to 70B. LongLoRA extends Llama2 7B from 4k context to 100k, or Llama2 70B to 32k on a single 8× A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like Flash-Attention2. In addition, we further conduct supervised fine-tuning with LongLoRA and our long instruction-following LongAlpaca dataset. All our code, models, dataset, and demo are available at github.com/dvlab-research/LongLoRA.

我们提出了 LongLoRA,一种高效微调方法,能够在有限计算成本下扩展预训练大语言模型的上下文大小。通常,训练具有长上下文大小的语言模型计算开销巨大,需要大量的训练时间和 GPU 资源。例如,在上下文长度为 8192 上进行训练,其自注意力层的计算成本是长度为 2048 时的 16 倍。本文从两个方面加速了大语言模型的上下文扩展。一方面,尽管推理时需要密集全局注意力,但模型的微调可以通过稀疏局部注意力高效完成。提出的移位稀疏注意力有效支持了上下文扩展,在性能上与使用普通注意力的微调相当,同时节省了大量计算。尤其,它在训练中只需两行代码即可实现,而在推理中则是可选的。另一方面,我们重新审视了用于上下文扩展的参数高效微调机制。值得注意的是,我们发现用于上下文扩展的 LoRA 在可训练的嵌入和归一化前提下效果良好。LongLoRA 将这种改进的 LoRA 与 S2-Attn 相结合。LongLoRA 在 Llama2 模型上从 7B/13B 到 70B 的各种任务中展现出强大的实证结果。LongLoRA 能够将 Llama2 7B 的上下文从 4k 扩展到 100k,或将 Llama2 70B 的上下文扩展到 32k,且均在单个 8× A100 机器上完成。LongLoRA 在扩展模型上下文的同时保留了其原始架构,并与大多数现有技术兼容。此外,我们进一步使用 LongLoRA 和我们构建的 LongAlpaca 长指令跟随数据集进行了监督微调。我们所有的代码、模型、数据集和演示均可于 github.com/dvlab-research/LongLoRA 获取。


NExT-GPT: Any-to-Any Multimodal LLM

https://huggingface.co/papers/2309.05519

https://arxiv.org/abs/2309.05519

https://next-gpt.github.io/

https://github.com/NExT-GPT/NExT-GPT


NExT-GPT:任意到任意的多模态大语言模型

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, image, video, and audio. By leveraging the existing well-trained high-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training but also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building a unified AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community.

尽管近期多模态大语言模型取得了令人振奋的进展,但它们大多受限于仅支持输入端的多模态理解,而无法以多种模态生成内容。鉴于人类总是通过多种模态感知世界并与他人交流,开发能够接受并以任意模态交付内容的任意到任意多模态大语言模型对于实现人类水平的人工智能至关重要。为填补这一空白,我们提出了一种端到端的通用型任意到任意多模态大语言模型系统——NExT-GPT。我们将一个LLM与多模态适配器和不同的扩散解码器连接起来,使NExT-GPT能够感知输入并以文本、图像、视频和音频的任意组合生成输出。通过利用现有训练有素的高性能编码器和解码器,NExT-GPT仅需微调少量参数即可,这不仅有利于低成本训练,也便于便捷地扩展到更多潜在模态。此外,我们引入了一种模态切换指令调优,并为其人工整理了一个高质量数据集,基于此,NExT-GPT具备了复杂的跨模态语义理解和内容生成能力。总体而言,我们的研究展示了构建能够建模通用模态的统一AI智能体的可行性,为社区内更接近人类水平的AI研究铺平了道路。


An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

https://huggingface.co/papers/2309.09958

https://arxiv.org/abs/2309.09958

https://github.com/haotian-liu/LLaVA


指令调优大型多模态模型缩放的实证研究

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

视觉指令调优近期在 LLaVA 和 MiniGPT-4 等开源大型多模态模型上展现出了令人鼓舞的进展。然而,现有开源 LMM 的研究大多基于 13B 参数或更小的模型进行。在本文中,我们对将 LLaVA 扩展到 33B 和 65B/70B 规模进行了实证研究,并分享了我们在图像分辨率、数据混合以及 LoRA/QLoRA 等参数高效训练方法方面的探索发现。我们通过评估这些因素对模型在完成现实世界任务时多模态和语言能力的影响来进行分析。我们发现,扩展 LMM 规模能持续增强模型性能并提升语言能力,且 LoRA/QLoRA 调优 LMM 的性能可与全模型微调相媲美。此外,该研究强调了更高图像分辨率和混合多模态语言数据对提升 LMM 性能的重要性,并且视觉指令调优有时能提升 LMM 的纯语言能力。我们希望本研究能使更大规模的先进 LMM 研究更易于开展,从而为未来研究建立更强的基线。代码和检查点将公开发布。


ImageBind-LLM: Multi-modality Instruction Tuning

https://huggingface.co/papers/2309.03905

https://arxiv.org/abs/2309.03905

https://github.com/opengvlab/llama-adapter


ImageBind-LLM:多模态指令调优

We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality.

我们提出了 ImageBind-LLM,一种通过 ImageBind 实现的大语言模型多模态指令调优方法。现有工作主要集中于语言和图像指令调优,与此不同,我们的 ImageBind-LLM 仅通过图像-文本对齐训练,就能响应包括音频、3D点云、视频及其嵌入空间算术在内的多模态条件。在训练期间,我们采用一个可学习的绑定网络来对齐 LLaMA 与 ImageBind 图像编码器之间的嵌入空间。然后,经绑定网络变换的图像特征被添加到 LLaMA 所有层的词元嵌入中,通过一种免注意力和零初始化的门控机制逐步注入视觉指令。借助 ImageBind 的联合嵌入,简单的图像-文本训练使我们的模型展现出卓越的多模态指令跟随能力。在推理过程中,多模态输入被送入相应的 ImageBind 编码器,并通过提出的视觉缓存模型进行处理,以进一步增强跨模态嵌入。这个免训练的缓存模型从由 ImageBind 提取的三百万图像特征中进行检索,有效缓解了训练-推理模态不一致的问题。值得注意的是,通过我们的方法,ImageBind-LLM 能够响应多种模态的指令,并展现出显著的语言生成质量。