Skip to content

A Very Big Video Reasoning Suite

https://huggingface.co/papers/2602.20159

https://arxiv.org/abs/2602.20159


一个超大规模视频推理套件

Rapid progress in video models has largely focused on visual quality, leaving their reasoning capabilities underexplored. Video reasoning grounds intelligence in spatiotemporally consistent visual environments that go beyond what text can naturally capture, enabling intuitive reasoning over spatiotemporal structure, such as continuity, interaction, and causality. However, systematically studying video reasoning and its scaling behavior is hindered by the lack of large-scale video reasoning training data. To address this gap, we introduce the Very Big Video Reasoning (VBVR) Dataset, an unprecedentedly large-scale resource spanning 200 curated reasoning tasks following a principled taxonomy, and over one million video clips—approximately three orders of magnitude larger than existing datasets. We further present VBVR-Bench, a verifiable evaluation framework that moves beyond model-based judging by incorporating rule-based, human-aligned scorers, enabling reproducible and interpretable diagnosis of video reasoning capabilities. Leveraging the VBVR suite, we conduct one of the first large-scale scaling studies of video reasoning and observe early signs of emergent generalization to unseen reasoning tasks. Together, VBVR lays a foundation for the next stage of research in generalizable video reasoning. The data, benchmark tool kit, and models are released publicly at video-reason.com.

视频模型的快速进展主要聚焦于视觉质量,而其推理能力仍未得到充分探索。视频推理将智能根植于超越文本自然捕捉能力的、时空一致的视觉环境中,从而实现对时空结构(如连续性、交互和因果性)的直觉推理。然而,系统研究视频推理及其扩展行为因缺乏大规模视频推理训练数据而受阻。为填补这一空白,我们引入了大型视频推理数据集,这是一个前所未有的超大规模资源,包含遵循原则性分类法的200个精选推理任务和超过一百万个视频片段——其规模约比现有数据集大三个数量级。我们还提出了VBVR-Bench,一个可验证的评估框架,通过结合基于规则的、与人类对齐的评分器,超越了基于模型的评判,能够对视频推理能力进行可重复且可解释的诊断。利用VBVR套件,我们进行了首批大规模视频推理扩展研究之一,并观察到了向未见推理任务进行涌现式泛化的早期迹象。总之,VBVR为可泛化视频推理研究的下一阶段奠定了基础。数据、基准工具包和模型已在 video-reason.com 公开发布。


Does Your Reasoning Model Implicitly Know When to Stop Thinking?

https://huggingface.co/papers/2602.08354

https://arxiv.org/abs/2602.08354

北京航空航天大学字节跳动

你的推理模型是否隐含地知道何时该停止思考?

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) effectively incorporates SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

通过长思维链,大型推理模型在处理复杂推理任务方面的能力得到显著提升。然而,这种方法常常导致大量冗余,损害计算效率,并在实时应用中造成显著延迟。近期研究表明,更长的推理链与正确性往往不相关,甚至可能损害准确性。在对这一现象进行更深入的分析后,我们惊讶地发现并实证验证了大型推理模型隐含地知道何时该停止思考,而这一能力被当前的采样范式所掩盖。受此启发,我们引入了 SAGE(自我感知引导的高效推理),一种新颖的采样范式,释放了这一高效推理潜力。此外,将 SAGE 作为混合采样整合到基于组的强化学习(SAGE-RL)中,有效地将SAGE发现的高效推理模式融入标准的 pass@1 推理中,在多个具有挑战性的数学基准上显著提升了大型推理模型的推理准确性和效率。


OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

https://huggingface.co/papers/2602.05400

https://arxiv.org/abs/2602.05400

阿里

OPUS:迈向大语言模型预训练中高效且基于原则的逐次迭代数据选择

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

随着高质量公共文本接近枯竭,即所谓的“数据墙”现象,预训练正从“更多词元”转向“更高质量词元”。然而,现有方法要么依赖于忽略训练动态的启发式静态过滤器,要么使用基于原始梯度的、动态但与优化器无关的标准。我们提出了 OPUS(优化器诱导的投影效用选择),这是一个动态数据选择框架,它在优化器诱导的更新空间中定义效用。OPUS通过将由现代优化器塑造的有效更新投影到从稳定的、同分布代理导出的目标方向上来对候选数据进行评分。为了确保可扩展性,我们采用带有CountSketch的Ghost技术来提高计算效率,并使用玻尔兹曼采样来保证数据多样性,仅增加4.7%的额外计算开销。OPUS在不同的语料库、质量层级、优化器和模型规模上都取得了显著成果。在使用30B词元在FineWeb和FineWeb-Edu上对GPT-2 Large/XL进行预训练时,OPUS的性能优于工业级基线,甚至超越了完整的200B词元训练。此外,当与工业级静态过滤器结合使用时,即使数据质量较低,OPUS也能进一步提高预训练效率。更进一步,在SciencePedia上对Qwen3-8B-Base进行持续预训练时,OPUS仅使用0.5B词元就达到了比使用3B词元的完整训练更优越的性能,展示了在专业领域中显著的数据效率提升。


Weak-Driven Learning: How Weak Agents make Strong Agents Stronger

https://huggingface.co/papers/2602.08222

https://arxiv.org/abs/2602.08222

https://github.com/chenzehao82/Weak-Driven-Learning

北京航空航天大学中国电信

弱驱动学习:弱智能体如何让强智能体更强

As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once models grow highly confident, further training yields diminishing returns. While existing methods continue to reinforce target predictions, we find that informative supervision signals remain latent in models' own historical weak states. Motivated by this observation, we propose WMSS (Weak Agents Can Make Strong Agents Stronger), a post-training paradigm that leverages weak checkpoints to guide continued optimization. By identifying recoverable learning gaps via entropy dynamics and reinforcing them through compensatory learning, WMSS enables strong agents to improve beyond conventional post-training saturation. Experiments on mathematical reasoning and code generation datasets show that agents trained with our approach achieve effective performance improvements, while incurring zero additional inference cost.

随着后训练优化成为改进大型语言模型的核心环节,我们观察到一个持续存在的饱和瓶颈:一旦模型变得高度自信,进一步的训练就会产生递减的收益。尽管现有方法持续强化目标预测,我们发现信息丰富的监督信号仍然潜伏在模型自身的历史弱状态中。受此观察启发,我们提出了 WMSS(弱智能体能让强智能体更强),一种利用弱检查点来指导持续优化的后训练范式。通过熵动力学识别可恢复的学习差距,并通过补偿性学习对其进行强化,WMSS使强智能体能够在传统后训练饱和之外继续提升。在数学推理和代码生成数据集上的实验表明,使用我们的方法训练的智能体实现了有效的性能提升,同时零增加额外推理成本。


Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

https://huggingface.co/papers/2602.00919

https://arxiv.org/abs/2602.00919

https://github.com/greenvla/GreenVLA


Green-VLA:面向通用机器人的分阶段视觉-语言-动作模型

We introduce Green-VLA, a staged Vision–Language–Action framework for real-world deployment on the humanoid Green robot, while maintaining generalization across diverse embodiments. Green-VLA follows a five-stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) RL-based policy alignment. Progression builds semantic and physical priors, learns shared affordances, and aligns policies for long-horizon execution beyond behavior cloning. At its core is a unified data and control stack for robot fleets. A scalable data-processing pipeline including DataQA and temporal-alignment filters and synchronizes 3,000 hours of demonstrations; a unified, embodiment-aware action interface enables a single policy to control humanoids, mobile manipulators, and fixed-base arms; and the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and a joint-prediction-based guidance module that generalizes to unseen objects. Optimized for the Green humanoid, Green-VLA generalizes in a zero-shot manner to new embodiments and achieves state-of-the-art performance across bimanual systems and benchmarks, with RL alignment providing gains in success rate, robustness, and long-horizon efficiency.

我们提出了 Green-VLA,一个分阶段的视觉-语言-动作框架,旨在实现其在人形机器人 Green 上的真实世界部署,同时保持跨不同具身的泛化能力。Green-VLA遵循一个五阶段课程:基础视觉语言模型阶段、多模态基础阶段、多具身预训练阶段、特定具身适应阶段以及基于强化学习的策略对齐阶段。这种逐步推进的方式构建了语义和物理先验知识,学习了共享的可供性,并对齐了策略,以实现超越行为克隆的长周期任务执行。其核心是一个统一的机器人群数据与控制栈。一个可扩展的数据处理流程包含DataQA和时间对齐过滤器,能够同步3000小时的演示数据;一个统一的、具身感知的动作接口使得单一策略能够控制人形机器人、移动操作臂和固定基座机械臂;并且,视觉语言动作控制器通过回合进度预测、分布外检测以及一个基于联合预测的引导模块得到增强,该模块能泛化到未见过的物体。针对Green人形机器人进行了优化后,Green-VLA能够以零样本方式泛化到新的具身,并在双臂系统和相关基准测试上取得了最先进的性能,而强化学习对齐在成功率、鲁棒性和长周期任务效率方面带来了提升。


ERNIE 5.0 Technical Report

https://huggingface.co/papers/2602.04705

https://arxiv.org/abs/2602.04705

百度

ERNIE 5.0 技术报告

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model designed for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

在本报告中,我们介绍了 ERNIE 5.0,一个原生自回归的基础模型,旨在实现对文本、图像、视频和音频的统一多模态理解与生成。所有模态均在统一的下一组词元预测目标下从头开始训练,该模型基于超稀疏混合专家架构,并采用模态无关的专家路由机制。为了应对不同资源约束下大规模部署的实际挑战,ERNIE 5.0 采用了一种新颖的弹性训练范式。在单次预训练运行中,模型学习到一系列具有不同深度、专家容量和路由稀疏度的子模型,从而能够在内存或时间受限的场景中,在性能、模型规模和推理延迟之间进行灵活权衡。此外,我们系统性地解决了将强化学习扩展到统一基础模型的挑战,从而保证了在超稀疏MoE架构和多样化多模态设置下高效且稳定的后训练。大量实验证明,ERNIE 5.0 在多个模态上均取得了强劲且均衡的性能。据我们所知,在公开披露的模型中,ERNIE 5.0 代表了首个支持多模态理解和生成的万亿参数统一自回归模型的生产级实现。为了促进进一步研究,我们展示了统一模型中模态无关专家路由的详细可视化,以及全面的弹性训练实证分析,旨在为社区提供深刻的见解。


Kimi K2.5: Visual Agentic Intelligence

https://huggingface.co/papers/2602.02276

https://arxiv.org/abs/2602.02276

https://www.kimi.com/blog/kimi-k2-5

https://github.com/MoonshotAI/Kimi-K2.5

月之暗面

Kimi K2.5:视觉智能体智能

We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to 4.5× over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

我们推出了 Kimi K2.5,一个开源的、旨在推进通用智能体智能的多模态智能体模型。K2.5 强调文本与视觉的联合优化,使两种模态相互增强。这包括一系列技术,如联合文本-视觉预训练、零视觉监督微调,以及联合文本-视觉强化学习。在此多模态基础之上,K2.5 引入了 Agent Swarm,一个自主并行的智能体编排框架,能够动态地将复杂任务分解为异构子问题并并发执行。广泛评估表明,Kimi K2.5 在包括编程、视觉、推理和智能体任务在内的多个领域取得了最先进的成果。Agent Swarm 相较单智能体基线,最多可将延迟降低 4.5 倍。我们发布了训练后的 Kimi K2.5 模型检查点,以促进未来研究及智能体智能在现实世界中的应用。


PaperBanana: Automating Academic Illustration for AI Scientists

https://huggingface.co/papers/2601.23265

https://arxiv.org/abs/2601.23265

https://dwzhu-pku.github.io/PaperBanana/

https://github.com/dwzhu-pku/PaperBanana

Google

PaperBanana:为 AI 科学家自动化生成学术插图

Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

尽管由语言模型驱动的自主AI科学家取得了快速进展,但在研究流程中,生成可发表级别的插图仍然是一个劳动密集型的瓶颈。为了减轻这一负担,我们推出了 PaperBanana,一个用于自动生成可发表级别学术插图的智能体框架。PaperBanana 由最先进的 VLM 和图像生成模型驱动,它编排专门的智能体来检索参考文献、规划内容和风格、渲染图像,并通过自我批评进行迭代优化。为了严格评估我们的框架,我们引入了 PaperBananaBench,包含从 NeurIPS 2025 论文中精选的 292 个方法学图示测试用例,涵盖了多样化的研究领域和插图风格。全面的实验表明,PaperBanana 在忠实度、简洁性、可读性和美学方面持续优于领先的基线模型。我们进一步展示了我们的方法能够有效扩展到生成高质量的统计图表。总的来说,PaperBanana 为自动化生成可发表级别的插图铺平了道路。