3D Gaussian Splatting for Real-Time Radiance Field Rendering
3D高斯溅射用于实时辐射场渲染
Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (
辐射场方法近期彻底改变了由多张照片或视频捕捉场景的新视角合成。然而,要实现高视觉质量仍需依赖训练和渲染成本高昂的神经网络,而近期更快速的方法不可避免地需要在速度和质量之间进行权衡。对于无边界且完整的场景以及1080p分辨率的渲染,尚无现有方法能够达到实时显示帧率。我们引入了三个关键要素,使我们能够在保持具有竞争力的训练时间的同时实现最先进的视觉质量,更重要的是,能够在1080p分辨率下实现高质量实时新视角合成。首先,从相机标定过程中产生的稀疏点出发,我们使用3D高斯表示场景,这保留了连续体积辐射场在场景优化中的理想特性,同时避免了在空区域进行不必要的计算;其次,我们对3D高斯进行交替优化和密度控制,特别通过优化各向异性协方差来实现场景的精确表示;第三,我们开发了一种快速的可见性感知渲染算法,该算法支持各向异性溅射,既加速了训练又实现了实时渲染。我们在多个公认数据集上展示了最先进的视觉质量和实时渲染能力。
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales
DeepSpeed-Chat:面向各类规模类ChatGPT模型的简易、快速且经济的RLHF训练
ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI.
类ChatGPT模型已彻底改变了人工智能领域的诸多应用,从摘要、编程到翻译,其表现已媲美甚至超越人类水平。然而,当前针对这些强大模型,尤其是训练规模达数十亿参数时,尚缺乏一个易于获取、高效且成本经济的端到端RLHF训练流程。本文介绍了DeepSpeed-Chat,一个旨在使RLHF训练大众化的创新系统,让人工智能社区能够更便捷地使用。DeepSpeed-Chat提供三项核心能力:为类ChatGPT模型提供易于使用的训练和推理体验;一个复现InstructGPT训练流程的DeepSpeed-RLHF流程;以及一个以统一方式结合多种训练与推理优化的强大DeepSpeed-RLHF系统。该系统展现出无与伦比的效率和可扩展性,能够以前所未有的速度和极低的成本训练拥有数千亿参数的模型。借助这一成果,DeepSpeed-Chat为更广泛地获取先进RLHF训练铺平了道路,即使是资源有限的数据科学家也能从中受益,从而推动人工智能领域的创新和进一步发展。
Nougat: Neural Optical Understanding for Academic Documents
Meta
Nougat:面向学术文档的神经光学理解
Scientific knowledge is predominantly stored in books and scientific journals, often in the form of PDFs. However, the PDF format leads to a loss of semantic information, particularly for mathematical expressions. We propose Nougat (Neural Optical Understanding for Academic Documents), a Visual Transformer model that performs an Optical Character Recognition (OCR) task for processing scientific documents into a markup language, and demonstrate the effectiveness of our model on a new dataset of scientific documents. The proposed approach offers a promising solution to enhance the accessibility of scientific knowledge in the digital age, by bridging the gap between human-readable documents and machine-readable text. We release the models and code to accelerate future work on scientific text recognition.
科学知识主要以书籍和科学期刊的形式存储,通常以PDF格式存在。然而,PDF格式会导致语义信息的丢失,特别是对于数学表达式。我们提出了Nougat,一个用于执行光学字符识别任务以将科学文档处理为标记语言的视觉Transformer模型,并在一个新的科学文档数据集上证明了我们模型的有效性。所提出的方法为在数字时代增强科学知识的可获取性提供了一个有前景的解决方案,它弥合了人类可读文档与机器可读文本之间的鸿沟。我们发布模型和代码,以加速未来在科学文本识别方面的工作。
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
腾讯
IP-Adapter:面向文本到图像扩散模型的文本兼容图像提示适配器
Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation.
近年来,大型文本到图像扩散模型展现出强大能力,以其令人印象深刻的生成能力创造出高保真图像。然而,仅使用文本提示生成所需图像往往非常棘手,因为这通常涉及复杂的提示工程。文本提示的一个替代方案是图像提示,常言道:“一图胜千言”。尽管现有方法通过对预训练模型进行直接微调是有效的,但它们需要大量的计算资源,并且不兼容其他基础模型、文本提示和结构控制。在本文中,我们提出了 IP-Adapter,一个有效且轻量级的适配器,旨在为预训练的文本到图像扩散模型实现图像提示能力。我们 IP-Adapter 的关键设计是解耦的交叉注意力机制,该机制将用于文本特征和图像特征的交叉注意力层分离。尽管我们的方法简单,但一个仅有 2200 万参数的 IP-Adapter 可以达到甚至超越完全微调的图像提示模型的性能。由于我们冻结了预训练的扩散模型,所提出的 IP-Adapter 不仅可以泛化到从同一基础模型微调的其他定制模型,还可以利用现有的可控工具实现可控生成。得益于解耦的交叉注意力策略,图像提示还可以与文本提示良好协作,实现多模态图像生成。