AI Can Learn Scientific Taste
OpenMOSS
AI 可以学会科研品味
Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.
伟大的科学家往往具备很强的判断力与前瞻性,这与我们所说的“科研品味”密切相关。这里我们用它指代判断并提出具有高潜在影响力研究想法的能力。然而,现有相关工作大多聚焦于提升 AI 科学家的执行能力,而对 AI 的科研品味如何提升研究不足。本文提出社区反馈强化学习(RLCF),利用大规模社区信号作为监督,并将科研品味学习建模为偏好建模与对齐问题。在偏好建模阶段,我们基于 70 万组领域与时间匹配的高引用/低引用论文对训练 Scientific Judge 来判断想法优劣;在偏好对齐阶段,则以 Scientific Judge 为奖励模型训练策略模型 Scientific Thinker,生成更具潜在影响力的研究想法。实验表明,Scientific Judge 超过了 GPT-5.2、Gemini 3 Pro 等最先进大模型,并能泛化到未来年份测试、未见领域及同行评审偏好上。进一步地,Scientific Thinker 提出的研究想法比基线具有更高潜在影响力。结果说明,AI 可以习得科研品味,这也是迈向人类级 AI 科学家的关键一步。
Demystifing Video Reasoning
SenseNovahttps://github.com/OpenSenseNova/Demystifying_Video_Reasoning
视频推理揭秘
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
近期视频生成的进展揭示了一个意外现象:基于扩散的视频模型展现出非平凡的推理能力。以往工作将其归因于“帧链”(CoF)机制,认为推理沿着视频帧顺序展开。本文挑战这一假设,并揭示了一种根本不同的机制。我们发现,视频模型中的推理主要沿着扩散去噪步骤涌现,而不是沿帧展开。通过定性分析和针对性探针实验,我们发现模型在早期去噪步骤中会探索多个候选解,并逐步收敛到最终答案,我们将这一过程称为“步骤链”(CoS)。除这一核心机制外,我们还识别出若干对性能至关重要的涌现行为:1)工作记忆,使模型能够持续保持参照;2)自我纠错与增强,使模型能从错误的中间解中恢复;3)先感知后行动,即早期步骤建立语义锚定,后期步骤完成结构化操作。进一步地,在单个扩散步骤内部,我们发现扩散 Transformer 出现了自演化的功能分工:前层编码致密感知结构,中层执行推理,后层整合潜表示。基于这些发现,我们提出一个简单的免训练策略作为概念验证:利用同一模型不同随机种子产生的潜轨迹进行集成,可以提升推理能力。整体而言,本文系统解释了视频生成模型中推理能力的涌现机制,为未来更好利用视频模型内在推理动态这一新型智能载体奠定了基础。
InCoder-32B: Code Foundation Model for Industrial Scenarios
https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder
北京航空航天大学
InCoder-32B:面向工业场景的代码基础模型
Recent code large language models have achieved remarkable progress on general programming tasks. Nevertheless, their performance degrades significantly in industrial scenarios that require reasoning about hardware semantics, specialized language constructs, and strict resource constraints. To address these challenges, we introduce InCoder-32B (Industrial-Coder-32B), the first 32B-parameter code foundation model unifying code intelligence across chip design, GPU kernel optimization, embedded systems, compiler optimization, and 3D modeling. By adopting an efficient architecture, we train InCoder-32B from scratch with general code pre-training, curated industrial code annealing, mid-training that progressively extends context from 8K to 128K tokens with synthetic industrial reasoning data, and post-training with execution-grounded verification. We conduct extensive evaluation on 14 mainstream general code benchmarks and 9 industrial benchmarks spanning 4 specialized domains. Results show InCoder-32B achieves highly competitive performance on general tasks while establishing strong open-source baselines across industrial domains.
近年来代码大模型在通用编程任务上取得了显著进展,但在需要硬件语义理解、专用语言结构推理以及严格资源约束的工业场景中,性能仍会明显下降。为应对这些挑战,我们提出 InCoder-32B(Industrial-Coder-32B),这是首个统一芯片设计、GPU Kernel 优化、嵌入式系统、编译器优化和 3D 建模等工业代码智能的 320 亿参数代码基础模型。通过采用高效架构,我们从零开始训练 InCoder-32B,并依次进行通用代码预训练、精选工业代码退火、中期训练(使用合成工业推理数据将上下文从 8K 逐步扩展到 128K),以及基于执行验证的后训练。我们在 14 个主流通用代码基准和覆盖 4 个专业领域的 9 个工业基准上进行了全面评测。结果表明,InCoder-32B 在通用任务上保持了很强竞争力,同时在工业领域建立了强有力的开源基线。
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
SocialOmni:评测全模态模型中的视听社交交互能力
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
全模态大语言模型通过原生整合音频、视觉和文本,正在重新定义人机交互。然而,现有 OLM 基准仍然停留在静态、以准确率为中心的任务上,难以评估社交交互性这一核心能力,即模型在自然对话中理解并处理动态线索的能力。为此,我们提出 SocialOmni,一个从三个核心维度对这种会话交互性进行系统评测的综合基准:1)说话人分离与识别,即“谁在说话”;2)打断时机控制,即“什么时候插话”;3)自然打断生成,即“如何说出插话内容”。SocialOmni 包含 2000 个感知样本,以及一个经过质量控制的诊断集,覆盖 209 个带有严格时间与上下文约束的交互生成实例,并额外设计了受控的音视频不一致场景来测试模型鲁棒性。我们对 12 个领先的全模态模型进行了基准测试,结果揭示了不同模型在社交交互能力上的显著差异。进一步分析表明,模型的感知准确率与其生成语境恰当打断语句的能力之间存在明显脱钩,这说明仅靠理解类指标无法刻画会话中的社会能力。更令人鼓舞的是,SocialOmni 提供的诊断信号为未来弥合感知与交互之间的鸿沟提供了可操作的方向。
Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
rednote-hilab
利用群体级自然语言反馈引导强化学习探索
Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2times improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.
大语言模型在与环境交互时通常会接收到丰富的自然语言反馈,但当前强化学习算法几乎只依赖标量奖励,导致这些自然语言中蕴含的丰富信息没有被充分利用,也使探索效率受限。本文提出 GOLF,一个显式利用群体级语言反馈来引导定向探索的强化学习框架。GOLF 汇聚两类互补反馈源:1)外部批评,指出错误或给出针对性修正;2)组内尝试,提供替代性的部分思路和多样化失败模式。系统将这些群体级反馈聚合为高质量 refinements,并将其自适应注入训练过程,作为 off-policy 脚手架,在稀疏奖励区域提供更有针对性的指导。同时,GOLF 在统一的 RL 闭环中联合优化生成与修正能力,形成相互促进的正循环。我们在可验证与不可验证基准上的实验表明,GOLF 在性能和探索效率上均优于现有方法,相比只使用标量奖励训练的 RL 方法,样本效率提升达到 2.2 倍。
Heterogeneous Agent Collaborative Reinforcement Learning
字节跳动
异构智能体协同强化学习
We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative optimization with independent execution: heterogeneous agents share verified rollouts during training to mutually improve, while operating independently at inference time. Unlike LLM-based multi-agent reinforcement learning (MARL), HACRL does not require coordinated deployment, and unlike on-/off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer. Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate capability discrepancies and policy distribution shifts, HACPO introduces four tailored mechanisms with theoretical guarantees on unbiased advantage estimation and optimization correctness. Extensive experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3% while using only half the rollout cost.
我们提出异构智能体协同强化学习(HACRL),这是一种旨在解决孤立式 on-policy 优化低效率问题的新范式。HACRL 实现了“训练协同、推理独立”:异构智能体在训练时共享经验证的 rollout 以相互提升,而在推理阶段仍各自独立运行。不同于基于 LLM 的多智能体强化学习,HACRL 不要求部署时协同;也不同于 on/off-policy 蒸馏,它支持异构智能体之间双向互学,而非单向的教师到学生迁移。基于这一范式,我们进一步提出 HACPO,一种能够进行原则化 rollout 共享的协同 RL 算法,以最大化样本利用率和跨智能体知识迁移。为缓解能力差异与策略分布偏移,HACPO 设计了四种具有理论保证的机制,从而确保优势估计无偏和优化过程正确。跨多种异构模型组合与推理基准的实验表明,HACPO 能稳定提升所有参与智能体的表现,在仅使用一半 rollout 成本的情况下,平均比 GSPO 高 3.3%。
Helios: Real Real-Time Long Video Generation Model
字节跳动
Helios:真正实时的长视频生成模型
We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
我们提出 Helios,这是首个可在单张 NVIDIA H100 GPU 上以 19.5 FPS 运行、支持分钟级生成,同时质量可与强基线匹敌的 140 亿参数视频生成模型。它在三个关键维度上实现突破:1)无需 self-forcing、error bank、关键帧采样等常见防漂移启发式策略,也能抵抗长视频漂移;2)无需 KV-cache、稀疏/线性注意力或量化等标准加速技术,也能实现实时生成;3)训练时无需并行或分片框架,却仍能达到图像扩散级 batch size,并在 80GB 显存内容纳最多四个 14B 模型。具体而言,Helios 是一个统一输入表示的 14B 自回归扩散模型,原生支持 T2V、I2V 和 V2V 任务。为缓解长视频生成漂移,我们刻画了典型失败模式,并提出在训练中显式模拟漂移的简单有效策略,同时从源头消除重复运动。为提升效率,我们对历史上下文与噪声上下文进行重度压缩,并减少采样步数,使计算成本可与甚至低于 13 亿参数视频生成模型。除此之外,我们还引入了基础设施级优化,同时加速训练与推理并降低显存开销。大量实验表明,Helios 在短视频和长视频生成上都持续优于既有方法。
Utonia: Toward One Encoder for All Point Clouds
Pointcept
Utonia:迈向统一处理所有点云的单一编码器
We dream of a future where point clouds from all domains can come together to shape a single model that benefits them all. Toward this goal, we present Utonia, a first step toward training a single self-supervised point transformer encoder across diverse domains, spanning remote sensing, outdoor LiDAR, indoor RGB-D sequences, object-centric CAD models, and point clouds lifted from RGB-only videos. Despite their distinct sensing geometries, densities, and priors, Utonia learns a consistent representation space that transfers across domains. This unification improves perception capability while revealing intriguing emergent behaviors that arise only when domains are trained jointly. Beyond perception, we observe that Utonia representations can also benefit embodied and multimodal reasoning: conditioning vision-language-action policies on Utonia features improves robotic manipulation, and integrating them into vision-language models yields gains on spatial reasoning. We hope Utonia can serve as a step toward foundation models for sparse 3D data, and support downstream applications in AR/VR, robotics, and autonomous driving.
我们希望未来来自不同领域的点云能够汇聚成一个共同受益的统一模型。为此,我们提出 Utonia,迈出了在多领域数据上训练单一自监督 point transformer 编码器的第一步,覆盖遥感、室外 LiDAR、室内 RGB-D 序列、以物体为中心的 CAD 模型,以及从 RGB 视频提升而来的点云。尽管这些数据在传感几何、密度和先验上差异很大,Utonia 仍学习到一个跨域可迁移的一致表示空间。这种统一不仅提升了感知能力,也揭示了只有在多域联合训练时才会出现的一些有趣涌现行为。除感知外,我们还观察到 Utonia 表征也能帮助具身与多模态推理:用 Utonia 特征条件化视觉-语言-动作策略可以提升机器人操作效果,将其整合进视觉语言模型也能提升空间推理表现。我们希望 Utonia 能成为稀疏 3D 数据基础模型的一步,为 AR/VR、机器人和自动驾驶等下游应用提供支持。
MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification
MiroMind AIhttps://www.miromind.ai/blog/mirothinker-1.7-h1-towards-heavy-duty-research-agents-via-verification
MiroThinker-1.7 与 H1:通过验证机制迈向重型研究智能体
We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.
我们提出 MiroThinker-1.7,这是一种面向复杂长程推理任务的新型研究智能体。在此基础上,我们进一步提出 MiroThinker-H1,为该智能体引入更强的重型推理能力,从而实现更可靠的多步问题求解。具体来说,MiroThinker-1.7 通过一个强调结构化规划、上下文推理与工具交互的智能体式中期训练阶段,提高了每一步交互的可靠性,使其更擅长处理复杂任务中的多步交互和持续推理。MiroThinker-H1 则进一步将验证机制直接融入推理过程的局部与全局层面:中间推理决策可以在推理时被评估和修正,而整体推理轨迹也会被审计,以确保最终答案由连贯的证据链支撑。在覆盖开放网络研究、科学推理和金融分析的多项基准上,MiroThinker-H1 在深度研究任务上取得了最先进性能,同时在专业领域基准上也保持了强劲结果。我们还开源了 MiroThinker-1.7 和 MiroThinker-1.7-mini,在显著提升效率的同时提供了具有竞争力的研究智能体能力。
Attention Residuals
月之暗面
注意力残差
Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.
带有 PreNorm 的残差连接是现代 LLM 的标准配置,但它会以固定的单位权重累计所有层输出。这种统一聚合会导致隐藏状态随着深度不断膨胀,并逐步稀释每一层各自的贡献。我们提出 Attention Residuals(AttnRes),用对前面各层输出做 softmax 注意力的方式替代固定累加,使每一层都能以学习到的、依赖输入的权重选择性聚合更早的表示。为了缓解在大模型训练中对全部前层输出做注意力带来的内存与通信开销,我们进一步提出 Block AttnRes,将层划分为块并在块级表示上做注意力,在保留大部分增益的同时减少内存占用。结合基于缓存的流水线通信和两阶段计算策略后,Block AttnRes 成为几乎无需额外开销即可替换标准残差连接的实用方案。Scaling law 实验表明,该改进在不同模型规模上均成立,消融实验也验证了按内容进行深度选择的价值。我们还将 AttnRes 集成到 Kimi Linear 架构(总参数 48B、激活参数 3B)中,并在 1.4T token 上进行预训练,结果显示它缓解了 PreNorm 稀释问题,使不同深度上的输出幅值和梯度分布更加均匀,并在所有评测任务上带来下游性能提升。
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
https://kj-chen666.github.io/Hybrid-Memory-in-Video-World-Models/
H-EmbodVis
眼不见亦未忘:面向动态视频世界模型的混合记忆
Video world models have shown immense potential in simulating the physical world, yet existing memory mechanisms primarily treat environments as static canvases. When dynamic subjects hide out of sight and later re-emerge, current methods often struggle, leading to frozen, distorted, or vanishing subjects. To address this, we introduce Hybrid Memory, a novel paradigm requiring models to simultaneously act as precise archivists for static backgrounds and vigilant trackers for dynamic subjects, ensuring motion continuity during out-of-view intervals. To facilitate research in this direction, we construct HM-World, the first large-scale video dataset dedicated to hybrid memory. It features 59K high-fidelity clips with decoupled camera and subject trajectories, encompassing 17 diverse scenes, 49 distinct subjects, and meticulously designed exit-entry events to rigorously evaluate hybrid coherence. Furthermore, we propose HyDRA, a specialized memory architecture that compresses memory into tokens and utilizes a spatiotemporal relevance-driven retrieval mechanism. By selectively attending to relevant motion cues, HyDRA effectively preserves the identity and motion of hidden subjects. Extensive experiments on HM-World demonstrate that our method significantly outperforms state-of-the-art approaches in both dynamic subject consistency and overall generation quality.
视频世界模型在模拟物理世界方面展现出巨大潜力,但现有记忆机制大多默认环境是静态画布。当动态主体离开视野后再重新出现时,当前方法常常会出现主体冻结、扭曲或消失等问题。为此,我们提出 Hybrid Memory,这一新范式要求模型同时扮演两个角色:对静态背景进行精确归档的档案管理员,以及对动态主体持续跟踪的警觉观察者,从而在目标离开视野期间仍保持运动连续性。为推动这一方向研究,我们构建了首个面向混合记忆的大规模视频数据集 HM-World,其中包含 5.9 万段高保真视频片段,解耦了相机与主体轨迹,覆盖 17 个场景、49 类主体,并精心设计了出视野-再入视野事件以严格评估混合一致性。进一步地,我们提出专门的记忆架构 HyDRA,将记忆压缩为 token,并利用时空相关性驱动的检索机制。通过选择性关注相关运动线索,HyDRA 能有效保留隐藏主体的身份和运动信息。在 HM-World 上的大量实验表明,我们的方法在动态主体一致性和整体生成质量方面都显著优于现有最先进方法。
ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling
ShotStream:面向交互式叙事的流式多镜头视频生成
Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling.
多镜头视频生成对于长篇叙事至关重要,但当前双向架构往往交互性不足且延迟较高。我们提出 ShotStream,一种新的因果式多镜头架构,能够实现交互式叙事以及高效的在线逐帧生成。其核心做法是将任务重构为“基于历史上下文的下一镜头生成”,从而允许用户通过流式提示动态干预正在进行的叙事。我们首先将一个文本到视频模型微调为双向的下一镜头生成器,然后通过 Distribution Matching Distillation 将其蒸馏为因果式学生模型。为解决自回归生成中固有的镜头间一致性与误差累积问题,我们提出两项关键创新:其一是双缓存记忆机制,全局缓存保留条件帧以维护镜头间一致性,局部缓存保存当前镜头内已生成帧以保持镜头内一致性,并使用 RoPE 不连续指示器显式区分两种缓存;其二是两阶段蒸馏策略,先在真实历史镜头条件下进行镜头内 self-forcing,再逐步扩展到基于自生成历史的镜头间 self-forcing,以缩小训练与测试差距。实验表明,ShotStream 能以低于 1 秒的延迟生成连贯的多镜头视频,在单 GPU 上达到 16 FPS,质量可与更慢的双向模型持平甚至超越,为实时交互式叙事打开了道路。
dLLM: Simple Diffusion Language Modeling
加州大学伯克利分校
dLLM:简洁的扩散语言建模框架
Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.
尽管扩散语言模型(DLM)发展迅速,许多新方法实际上逐渐收敛到了相似的一组核心组件。然而这些组件往往分散在临时性的研究代码库中,或者缺乏透明实现,导致复现和扩展都较为困难。随着该领域加速发展,亟需一个既能标准化这些共性组件、又能灵活支持新方法与新架构的统一框架。为此,我们提出 dLLM,一个开源框架,将扩散语言建模的核心流程统一起来,包括训练、推理与评测,并使其易于针对新设计进行定制。借助 dLLM,用户可以通过标准化管线复现、微调、部署和评估 LLaDA、Dream 等开源大型 DLM。该框架还提供了使用可及计算资源从零构建小型 DLM 的最小可复现 recipe,包括将任意 BERT 风格编码器或自回归语言模型转换为 DLM。我们同时发布了这些小型 DLM 的检查点,以降低 DLM 研究门槛并加速后续工作。
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence
百度
Qianfan-OCR:统一的端到端文档智能模型
We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.
我们提出 Qianfan-OCR,一个 40 亿参数的端到端视觉语言模型,在单一架构中统一了文档解析、版面分析与文档理解。它可以直接将图像转换为 Markdown,并支持表格提取、图表理解、文档问答和关键信息抽取等多种由 prompt 驱动的任务。为解决端到端 OCR 中显式版面分析能力丢失的问题,我们提出 Layout-as-Thought:通过特殊 think token 触发一个可选的“思考阶段”,在最终输出前生成结构化的版面表示,包括边界框、元素类型和阅读顺序,从而在提升复杂版面准确率的同时恢复版面 grounding 能力。Qianfan-OCR 在 OmniDocBench v1.5 和 OlmOCR Bench 上分别取得 93.12 和 79.8 的端到端模型第一名,在 OCRBench、CCOCR、DocVQA 和 ChartQA 上也对同等规模通用 VLM 展现出有竞争力的结果,并在公开的关键信息抽取基准上获得最高平均分,超过 Gemini-3.1-Pro、Seed-2.0 和 Qwen3-VL-235B。该模型已通过百度智能云千帆平台公开可用。
Grounding World Simulation Models in a Real-World Metropolis
NAVER AI Lab
在真实都市中锚定世界模拟模型
What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse street-view images. We further introduce a Virtual Lookahead Sink to stabilize long-horizon generation by continuously re-grounding each chunk to a retrieved image at a future location. We evaluate SWM against recent video world models across three cities: Seoul, Busan, and Ann Arbor. SWM outperforms existing methods in generating spatially faithful, temporally consistent, long-horizon videos grounded in actual urban environments over trajectories reaching hundreds of meters, while supporting diverse camera movements and text-prompted scenario variations.
如果一个世界模拟模型渲染的不是想象中的环境,而是一座真实存在的城市,会怎样?以往生成式世界模型通常通过“想象”全部内容来合成视觉上合理但并不真实的环境。我们提出 Seoul World Model(SWM),一个锚定于真实首尔城市尺度的世界模型。SWM 通过基于检索的条件输入,将附近街景图像引入自回归视频生成过程。然而,这一设计也带来了若干挑战,包括检索参考与动态目标场景之间的时间错位、由稀疏车载采集导致的轨迹多样性不足与数据稀疏。我们通过跨时间配对、大规模合成数据集以及从稀疏街景图像合成连贯训练视频的视图插值管线来解决这些问题。进一步地,我们提出 Virtual Lookahead Sink,在长程生成中持续将每个片段重新锚定到未来位置检索到的图像,从而稳定生成过程。我们在首尔、釜山和安娜堡三座城市上将 SWM 与近期视频世界模型进行比较,结果表明,SWM 在真实城市环境中沿着数百米轨迹生成空间忠实、时间一致、长程视频方面显著优于现有方法,同时还支持多样化相机运动和文本提示场景变化。
OpenClaw-RL: Train Any Agent Simply by Talking
Princeton AI Lab
OpenClaw-RL:只需对话即可训练任意智能体
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards.
每一次智能体交互都会产生下一状态信号,例如用户回复、工具输出、终端或 GUI 状态变化,但现有 agentic RL 系统都没有把这些信号作为在线学习来源加以利用。我们提出 OpenClaw-RL,一个建立在简单观察之上的框架:下一状态信号是普适的,策略可以同时从所有这类信号中学习。无论是个人对话、终端执行、GUI 交互、SWE 任务还是工具调用轨迹,它们本质上都不是彼此独立的训练问题,而是同一种交互闭环中可被共同利用的学习数据。下一状态信号包含两类信息:一类是评估性信号,用于反映动作表现如何,并由 PRM judge 提取为标量奖励;另一类是指令性信号,用于说明动作本应如何不同,并通过 Hindsight-Guided On-Policy Distillation(OPD)恢复出来。我们从下一状态中提取文本提示,构造增强教师上下文,并提供比任何标量奖励都更丰富的 token 级方向性优势监督。由于采用异步设计,模型在服务在线请求的同时,PRM 持续评判交互,训练器也同步更新策略,三者之间几乎没有协调开销。用于个人智能体时,OpenClaw-RL 让智能体在被使用的过程中自然进步,从用户追问、纠正和显式反馈中回收会话信号;用于通用智能体时,同一套基础设施还能扩展到终端、GUI、SWE 与工具调用场景中的大规模 RL 训练。
HSImul3R: Physics-in-the-Loop Reconstruction of Simulation-Ready Human-Scene Interactions
MMLab@NTU
HSImul3R:面向可仿真的人场景交互物理闭环重建
We present HSImul3R, a unified framework for simulation-ready 3D reconstruction of human-scene interactions (HSI) from casual captures, including sparse-view images and monocular videos. Existing methods suffer from a perception-simulation gap: visually plausible reconstructions often violate physical constraints, leading to instability in physics engines and failure in embodied AI applications. To bridge this gap, we introduce a physically-grounded bi-directional optimization pipeline that treats the physics simulator as an active supervisor to jointly refine human dynamics and scene geometry. In the forward direction, we employ Scene-targeted Reinforcement Learning to optimize human motion under dual supervision of motion fidelity and contact stability. In the reverse direction, we propose Direct Simulation Reward Optimization, which leverages simulation feedback on gravitational stability and interaction success to refine scene geometry. We further present HSIBench, a new benchmark with diverse objects and interaction scenarios. Extensive experiments demonstrate that HSImul3R produces the first stable, simulation-ready HSI reconstructions and can be directly deployed to real-world humanoid robots.
我们提出 HSImul3R,一个统一框架,用于从日常采集数据(包括稀疏视角图像和单目视频)中重建可直接用于仿真的 3D 人-场景交互(HSI)。现有方法通常存在“感知-仿真鸿沟”:虽然重建结果在视觉上看似合理,但往往违反物理约束,导致在物理引擎中不稳定,并在具身 AI 应用中失效。为弥合这一鸿沟,我们引入一个具备物理约束的双向优化流程,将物理仿真器视作主动监督信号,共同优化人体动力学与场景几何。在正向过程中,我们采用 Scene-targeted Reinforcement Learning,在运动保真度和接触稳定性的双重监督下优化人体动作;在反向过程中,我们提出 Direct Simulation Reward Optimization,利用仿真反馈中的重力稳定性和交互成功信号来优化场景几何。我们还提出了 HSIBench,一个覆盖多样物体与交互场景的新基准。大量实验表明,HSImul3R 生成了首批稳定、可仿真的 HSI 重建结果,并可直接部署到真实人形机器人上。
OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
复旦大学
OmniLottie:通过参数化 Lottie Token 生成矢量动画
OmniLottie is a versatile framework that generates high quality vector animations from multi-modal instructions. For flexible motion and visual content control, we focus on Lottie, a light weight JSON formatting for both shapes and animation behaviors representation. However, the raw Lottie JSON files contain extensive invariant structural metadata and formatting tokens, posing significant challenges for learning vector animation generation. Therefore, we introduce a well designed Lottie tokenizer that transforms JSON files into structured sequences of commands and parameters representing shapes, animation functions and control parameters. Such tokenizer enables us to build OmniLottie upon pretrained vision language models to follow multi-modal interleaved instructions and generate high quality vector animations. To further advance research in vector animation generation, we curate MMLottie-2M, a large scale dataset of professionally designed vector animations paired with textual and visual annotations. With extensive experiments, we validate that OmniLottie can produce vivid and semantically aligned vector animations that adhere closely to multi modal human instructions.
OmniLottie 是一个能够根据多模态指令生成高质量矢量动画的通用框架。为了实现灵活的运动与视觉内容控制,我们聚焦于 Lottie,这是一种用于表示图形形状和动画行为的轻量级 JSON 格式。然而,原始 Lottie JSON 文件包含大量不随内容变化的结构元数据和格式化 token,这给矢量动画生成学习带来了明显挑战。因此,我们设计了一种精心构造的 Lottie tokenizer,将 JSON 文件转化为由命令和参数组成的结构化序列,用于表示形状、动画函数和控制参数。这个 tokenizer 使我们能够在预训练视觉语言模型之上构建 OmniLottie,从而遵循多模态交织指令生成高质量矢量动画。为进一步推动该方向研究,我们构建了 MMLottie-2M,一个大规模数据集,包含专业设计的矢量动画及其文本和视觉标注。大量实验表明,OmniLottie 能生成生动、语义一致并且高度贴合多模态人类指令的矢量动画。