Skip to content

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

https://huggingface.co/papers/2312.11514

https://arxiv.org/abs/2312.11514


闪存中的大语言模型:内存受限下的高效推理

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices with limited DRAM capacity. This paper tackles the challenge of efficiently running LLMs that exceed the available DRAM capacity by storing the model parameters in flash memory, but bringing them on demand to DRAM. Our method involves constructing an inference cost model that takes into account the characteristics of flash memory, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Within this hardware-informed framework, we introduce two principal techniques. First, "windowing" strategically reduces data transfer by reusing previously activated neurons, and second, "row-column bundling", tailored to the sequential data access strengths of flash memory, increases the size of data chunks read from flash memory. These methods collectively enable running models up to twice the size of the available DRAM, with up to 4x and 20x increase in inference speed compared to naive loading approaches in CPU and GPU, respectively. Our integration of sparsity awareness, context-adaptive loading, and a hardware-oriented design paves the way for effective inference of LLMs on devices with limited memory.

大语言模型是现代自然语言处理的核心,在各种任务中表现出卓越的性能。然而,它们巨大的计算和内存需求带来了挑战,特别是对于 DRAM 容量有限的设备。本文通过将模型参数存储在闪存中,并按需加载到 DRAM,解决了高效运行超出可用 DRAM 容量的大语言模型的难题。我们的方法涉及构建一个考虑闪存特性的推理成本模型,指导我们在两个关键领域进行优化:减少从闪存传输的数据量,以及以更大、更连续的块读取数据。在此硬件感知框架内,我们引入了两种主要技术。首先,“窗口化”通过重用先前激活的神经元来策略性地减少数据传输;其次,“行列捆绑”针对闪存顺序数据访问的优势,增加了从闪存读取的数据块大小。这些方法共同使得能够运行的模型大小可达可用 DRAM 的两倍,与 CPU 和 GPU 上的朴素加载方法相比,推理速度分别提升高达 4 倍和 20 倍。我们对稀疏性感知、上下文自适应加载和硬件导向设计的整合,为在内存受限的设备上有效推理大语言模型铺平了道路。


Mamba: Linear-Time Sequence Modeling with Selective State Spaces

https://huggingface.co/papers/2312.00752

https://arxiv.org/abs/2312.00752

https://github.com/state-spaces/mamba


Mamba:基于选择性状态空间的线性时间序列建模

Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5× higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.

基础模型如今驱动着深度学习中最激动人心的大部分应用,它们几乎都基于Transformer架构及其核心的注意力模块。为了解决Transformer在长序列上的计算低效问题,人们开发了许多次二次时间复杂度的架构,如线性注意力、门控卷积和循环模型,以及结构化状态空间模型,但它们在与语言等重要模态相关的任务上,表现不如注意力机制。我们发现这类模型的一个关键弱点在于无法执行基于内容的推理,并为此进行了若干改进。首先,我们让SSM的参数成为输入的函数,这解决了它们在离散模态上的弱点,使得模型能够根据当前词元,沿序列长度维度选择性地传播或遗忘信息。其次,尽管这一改变使得无法使用高效卷积,但我们设计了一种在循环模式下运行的、硬件感知的并行算法。我们将这些选择性SSM集成到一个简化了的、无注意力甚至无MLP块的端到端神经网络架构中。Mamba享有快速推理的优势,其吞吐量是Transformer的5倍,且序列长度上呈线性扩展,在处理长达百万长度的真实数据序列时性能依然提升。作为一个通用的序列模型骨干,Mamba在语言、音频和基因组学等多种模态上均取得了最先进的性能。在语言建模方面,我们的Mamba-3B模型在预训练和下游评估中都优于同规模Transformer,并与两倍规模的Transformer性能相当。


StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

https://huggingface.co/papers/2312.12491

https://arxiv.org/abs/2312.12491

https://github.com/cumulo-autumn/StreamDiffusion


StreamDiffusion:面向实时交互生成的管道级解决方案

We introduce StreamDiffusion, a real-time diffusion pipeline designed for streaming image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as augmented/virtual reality, video game graphics rendering, live video streaming, and broadcasting, where high throughput is imperative. StreamDiffusion tackles this challenge through a novel pipeline-level system design. It employs unique strategies like batching the denoising process (Stream Batch), residual classifier-free guidance (R-CFG), and stochastic similarity filtering (SSF). Additionally, it seamlessly integrates advanced acceleration technologies for maximum efficiency. Specifically, Stream Batch reformulates the denoising process by eliminating the traditional wait-and-execute approach and utilizing a batching denoising approach, facilitating fluid and high-throughput streams. This results in 1.5x higher throughput compared to the conventional sequential denoising approach. R-CFG significantly addresses inefficiencies caused by repetitive computations during denoising. It optimizes the process to require minimal or no additional computations, leading to speed improvements of up to 2.05x compared to previous classifier-free methods. Besides, our stochastic similarity filtering dramatically lowers GPU activation frequency by halting computations for static image flows, achieving a remarkable reduction in computational consumption—2.39 times on an RTX 3060 GPU and 1.99 times on an RTX 4090 GPU, respectively. The synergy of our proposed strategies with established acceleration technologies enables image generation to reach speeds of up to 91.07 fps on a single RTX 4090 GPU, outperforming the throughput of AutoPipeline, developed by Diffusers, by more than 59.56x.

我们推出了 StreamDiffusion,一个专为流式图像生成设计的实时扩散管道。现有的扩散模型擅长从文本或图像提示生成图像,但它们在实时交互方面常常力不从心。这一限制在涉及连续输入的场景中尤为明显,例如增强/虚拟现实、视频游戏图形渲染、实时视频流和广播,这些场景对高吞吐量有严格要求。StreamDiffusion 通过一种新颖的管道级系统设计来应对这一挑战。它采用了独特的策略,如对去噪过程进行批处理、残差无分类器引导和随机相似性过滤。此外,它无缝集成了先进的加速技术以实现最大效率。具体来说,Stream Batch 通过消除传统的等待执行方法,并采用批处理去噪方法,重构了去噪过程,促进了流畅且高吞吐量的流式处理。与传统的顺序去噪方法相比,这带来了 1.5 倍的吞吐量提升。R-CFG 显著解决了去噪过程中重复计算导致的低效问题。它优化了该过程,使其只需极少或无需额外计算,与之前的无分类器方法相比,速度提升高达 2.05 倍。此外,我们的随机相似性过滤通过停止对静态图像流的计算,显著降低了 GPU 的激活频率,实现了计算消耗的显著减少。我们提出的策略与现有加速技术的协同作用,使得在单个 RTX 4090 GPU 上的图像生成速度高达 91.07 fps,比 Diffusers 开发的 AutoPipeline 的吞吐量高出超过 59.56 倍。


PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

https://huggingface.co/papers/2312.04461

https://arxiv.org/abs/2312.04461

https://github.com/TencentARC/PhotoMaker


PhotoMaker:通过堆叠ID嵌入定制逼真人物照片

Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized text-to-image generation method, which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding, serving as a unified ID representation, can not only encapsulate the characteristics of the same input ID comprehensively, but also accommodate the characteristics of different IDs for subsequent integration. This paves the way for more intriguing and practically valuable applications. Besides, to drive the training of our PhotoMaker, we propose an ID-oriented data construction pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline, our PhotoMaker demonstrates better ID preservation ability than test-time fine-tuning based methods, yet provides significant speed improvements, high-quality generation results, strong generalization capabilities, and a wide range of applications.

近期文本到图像生成的进展在根据给定文本提示合成逼真人物照片方面取得了显著进步。然而,现有的个性化生成方法无法同时满足高效率、高身份保真度和灵活文本可控性的要求。在这项工作中,我们引入了 PhotoMaker,一种高效的个性化文本到图像生成方法,它主要将任意数量的输入 ID 图像编码为一个堆叠 ID 嵌入以保留 ID 信息。这种嵌入作为一个统一的 ID 表示,不仅能全面封装同一输入 ID 的特征,还能容纳不同 ID 的特征以供后续整合。这为更有趣且实用价值的应用铺平了道路。此外,为了推动 PhotoMaker 的训练,我们提出了一个面向 ID 的数据构建流程来组装训练数据。在通过该流程构建的数据集的滋养下,我们的 PhotoMaker 比基于测试时微调的方法展现出更好的 ID 保留能力,同时提供了显著的速度提升、高质量的生成结果、强大的泛化能力和广泛的应用。


Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

https://huggingface.co/papers/2312.09911

https://arxiv.org/abs/2312.09911

https://github.com/open-mmlab/amphion


Amphion:一个开源的音频、音乐和语音生成工具包

Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that includes diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, allowing both beginners and seasoned researchers to kick-start their projects with relative ease. The initial release of Amphion v0.1 supports a range of tasks including Text to Speech (TTS), Text to Audio (TTA), and Singing Voice Conversion (SVC), supplemented by essential components like data preprocessing, state-of-the-art vocoders, and evaluation metrics. This paper presents a high-level overview of Amphion.

Amphion 是一个面向音频、音乐和语音生成的开源工具包,旨在为初级研究人员和工程师进入这些领域提供便利。它提供了一个统一的框架,包含多样化的生成任务和模型,并且易于扩展以集成新功能。该工具包设计了初学者友好的工作流程和预训练模型,使初学者和经验丰富的研究者都能相对轻松地启动项目。Amphion v0.1 的初始版本支持一系列任务,包括文本到语音、文本到音频和歌声转换,并辅以数据预处理、最先进的声码器和评估指标等核心组件。本文对 Amphion 进行了高层次概述。


PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU

https://huggingface.co/papers/2312.12456

https://arxiv.org/abs/2312.12456

https://github.com/Tiiny-AI/PowerInfer


PowerInfer:基于消费级GPU的快速大语言模型服务

This paper introduces PowerInfer, a high-speed Large Language Model (LLM) inference engine on a personal computer (PC) equipped with a single consumer-grade GPU. The key principle underlying the design of PowerInfer is exploiting the high locality inherent in LLM inference, characterized by a power-law distribution in neuron activation. This distribution indicates that a small subset of neurons, termed hot neurons, are consistently activated across inputs, while the majority, cold neurons, vary based on specific inputs. PowerInfer exploits such an insight to design a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers. PowerInfer further integrates adaptive predictors and neuron-aware sparse operators, optimizing the efficiency of neuron activation and computational sparsity. The evaluation shows that PowerInfer significantly outperforms llama.cpp by up to 11.69× while retaining model accuracy across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance comparable to that of a high-end server-grade A100 GPU, reaching 82% of its token generation rate on a single consumer-grade RTX 4090 GPU.

本文介绍了 PowerInfer,一个在配备单块消费级 GPU 的个人电脑上的高速大语言模型推理引擎。PowerInfer 设计的关键原理在于利用大语言模型推理固有的高度局部性,其特征是神经元激活的幂律分布。该分布表明,一小部分神经元,称为热神经元,在各种输入中持续被激活,而大多数冷神经元则根据特定输入而变化。PowerInfer 利用这一洞察,设计了一个 GPU-CPU 混合推理引擎:热激活神经元被预加载到 GPU 上以实现快速访问,而冷激活神经元则在 CPU 上计算,从而显著降低 GPU 内存需求和 CPU-GPU 数据传输。PowerInfer 进一步集成了自适应预测器和神经元感知的稀疏算子,优化了神经元激活和计算稀疏性的效率。评估表明,在单块 NVIDIA RTX 4090 GPU 上,PowerInfer 在各种大语言模型上显著优于 llama.cpp,最高提升达 11.69 倍,同时保持模型精度。对于 OPT-30B 模型,PowerInfer 的性能可与高端服务器级 A100 GPU 相媲美,在单块消费级 RTX 4090 GPU 上达到了其词元生成速率的 82%。


StarVector: Generating Scalable Vector Graphics Code from Images

https://huggingface.co/papers/2312.11556

https://arxiv.org/abs/2312.11556

https://github.com/visioncortex/vtracer


StarVector:从图像生成可缩放矢量图形代码

Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation methods have focused on curve-based vectorization, lacking semantic understanding, often producing artifacts, and struggling with SVG primitives beyond path curves. To address these issues, we introduce StarVector, a multimodal large language model for SVG generation. It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs. Unlike traditional methods, StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. To train StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables generalization across vectorization tasks and precise use of primitives like ellipses, polygons, and text. We address challenges in SVG evaluation, showing that pixel-based metrics like MSE fail to capture the unique qualities of vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3 tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this setup, StarVector achieves state-of-the-art performance, producing more compact and semantically rich SVGs.

可缩放矢量图形因其可扩展性和多功能性,对于现代图像渲染至关重要。以往的 SVG 生成方法侧重于基于曲线的矢量化,缺乏语义理解,常常产生伪影,并且在处理路径曲线以外的 SVG 图元时表现不佳。为解决这些问题,我们引入了 StarVector,一个用于 SVG 生成的多模态大语言模型。它通过理解图像语义并运用 SVG 图元,执行图像矢量化,生成紧凑而精确的输出。与传统方法不同,StarVector 直接在 SVG 代码空间中工作,利用视觉理解来应用准确的 SVG 图元。为训练 StarVector,我们创建了 SVG-Stack,一个包含 200 万个样本的多样化数据集,使其能够泛化到各种矢量化任务,并精确使用椭圆、多边形和文本等图元。我们解决了 SVG 评估中的挑战,指出基于像素的指标无法捕捉矢量图形的独特特性。我们推出了 SVG-Bench,一个涵盖 10 个数据集和 3 项任务的基准:图像到 SVG、文本到 SVG 生成和图表示例生成。在此设置下,StarVector 实现了最先进的性能,生成的 SVG 更紧凑且语义更丰富。