LLaVA-o1: Let Vision Language Models Reason Step-by-Step
北京大学清华大学
Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
大语言模型在推理能力方面已展现出显著进步。然而,当前的视觉语言模型在进行系统化、结构化的推理时常常遇到困难,尤其是在处理复杂的视觉问答任务时。在这项工作中,我们引入了 LLaVA-CoT,一个旨在执行自主多阶段推理的大型视觉语言模型。与思维链提示不同,LLaVA-CoT 独立地依次进行总结、视觉解读、逻辑推理和结论生成等阶段。这种结构化方法使 LLaVA-CoT 在推理密集型任务上取得了显著改进。为实现这一目标,我们构建了 LLaVA-CoT-100k 数据集,该数据集整合了来自各种视觉问答来源的样本,并提供了结构化的推理注释。此外,我们提出了一种测试时的阶段级回溯搜索方法,该方法能够实现有效且高效的测试时扩展。值得注意的是,仅凭 10 万训练样本和测试时扩展,LLaVA-CoT 不仅在广泛的多模态推理基准上以 9.4% 的优势超越了其基础模型,还胜过了更大规模甚至闭源的模型。
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
JanusFlow:协调自回归与修正流的统一多模态理解与生成
We present JanusFlow, a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.
我们提出了 JanusFlow,一个强大的框架,它将图像理解与生成统一在单一模型中。JanusFlow 引入了一种极简架构,将自回归语言模型与修正流(一种生成建模领域的最先进方法)相结合。我们的关键发现表明,修正流可以直截了当地在大语言模型框架内进行训练,无需复杂的架构修改。为了进一步提升我们统一模型的性能,我们采用了两个关键策略:理解编码器和生成编码器解耦,以及在统一训练中对齐它们的表示。大量实验表明,JanusFlow 在其各自领域达到了与专用模型相当或更优的性能,同时在标准基准上显著优于现有的统一方法。这项工作代表着朝着更高效、更通用的视觉语言模型迈出的一步。