Skip to content

Less is More: Recursive Reasoning with Tiny Networks

https://huggingface.co/papers/2510.04871

https://arxiv.org/abs/2510.04871

https://github.com/SamsungSAILMontreal/TinyRecursiveModels

https://alexiajm.github.io/2025/09/29/tiny_recursive_models.html#

Meta

少即是多:基于微型网络的递归推理

Hierarchical Reasoning Model (HRM) is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (~ 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.

层次推理模型是一种新颖的方法,它使用两个以不同频率递归的小型神经网络。这种受生物学启发的方法在数独、迷宫和 ARC-AGI 等困难谜题任务上击败了大语言模型,同时仅用小型模型和少量数据进行训练。HRM 在用小网络解决难题方面展现出巨大潜力,但其机制尚不明确且可能并非最优。我们提出了微递归模型,一种更简单的递归推理方法,它仅使用一个只有 2 层的微型网络,就实现了比 HRM 显著更高的泛化能力。仅用 7M 参数,TRM 在 ARC-AGI-1 上获得了 45% 的测试准确率,在 ARC-AGI-2 上获得了 8% 的测试准确率,高于大多数 LLM,而参数量还不到它们的 0.01%。


Agent Learning via Early Experience

https://huggingface.co/papers/2510.08558

https://arxiv.org/abs/2510.08558


基于早期经验的智能体学习

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call . Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

语言智能体的一个长期目标是通过自身经验进行学习和改进,最终在复杂的现实世界任务中超越人类。然而,在许多环境中,使用强化学习从经验数据中训练智能体仍然困难重重,这些环境要么缺乏可验证的奖励,要么需要低效的长程展开。因此,当前大多数智能体依赖于在专家数据上进行监督微调,这种方法难以扩展且泛化能力差。这一局限性源于专家演示的本质:它们仅捕捉到狭窄的场景范围,并且使智能体接触到的环境多样性有限。我们通过一种称为早期经验的中间范式来解决这一局限性:由智能体自身行为生成的交互数据,其中产生的未来状态作为监督信号,而无需奖励信号。在此范式中,我们研究了使用此类数据的两种策略:(1) 隐式世界建模,利用收集的状态将策略根植于环境动态中;(2) 自我反思,智能体从其次优行为中学习以改进推理和决策。我们在八个不同的环境和多个模型系列上进行了评估。我们的方法持续提升了有效性和域外泛化能力,突显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果提供了有希望的信号,表明早期经验为后续的强化学习奠定了坚实基础,使其成为介于模仿学习和完全由经验驱动的智能体之间的实用桥梁。


Scaling Latent Reasoning via Looped Language Models

https://huggingface.co/papers/2510.25741

https://arxiv.org/abs/2510.25741

https://ouro-llm.github.io/


通过循环语言模型扩展潜在推理

Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era.

现代大语言模型主要通过显式文本生成来进行“思考”,例如思维链,这种做法将推理推迟到后训练阶段,且未能充分利用预训练数据。我们提出并开源了Ouro——以递归的衔尾蛇命名——一系列预训练的循环语言模型,它通过以下方式将推理构建到预训练阶段:潜在空间中的迭代计算、用于学习深度分配的熵正则化目标,以及扩展到 7.7T 词元。Ouro 1.4B 和 2.6B 模型展现出卓越的性能,在广泛的基准测试中与高达 12B 参数的最先进大语言模型的结果相匹配。通过控制实验,我们证明这一优势并非源于增加的知识容量,而是源于卓越的知识操控能力。我们还表明,循环语言模型产生的推理轨迹比显式思维链更符合最终输出。我们希望我们的结果能够展示循环语言模型作为推理时代一个新扩展方向的潜力。


Diffusion Transformers with Representation Autoencoders

https://huggingface.co/papers/2510.11690

https://arxiv.org/abs/2510.11690

https://rae-dit.github.io/

https://github.com/bytetriper/RAE


基于表征自编码器的扩散Transformer

Latent generative modeling, where a pretrained autoencoder maps pixels into a latent space for the diffusion process, has become the standard strategy for Diffusion Transformers (DiT); however, the autoencoder component has barely evolved. Most DiTs continue to rely on the original VAE encoder, which introduces several limitations: outdated backbones that compromise architectural simplicity, low-dimensional latent spaces that restrict information capacity, and weak representations that result from purely reconstruction-based training and ultimately limit generative quality. In this work, we explore replacing the VAE with pretrained representation encoders (e.g., DINO, SigLIP, MAE) paired with trained decoders, forming what we term Representation Autoencoders (RAEs). These models provide both high-quality reconstructions and semantically rich latent spaces, while allowing for a scalable transformer-based architecture. Since these latent spaces are typically high-dimensional, a key challenge is enabling diffusion transformers to operate effectively within them. We analyze the sources of this difficulty, propose theoretically motivated solutions, and validate them empirically. Our approach achieves faster convergence without auxiliary representation alignment losses. Using a DiT variant equipped with a lightweight, wide DDT head, we achieve strong image generation results on ImageNet: 1.51 FID at 256×256 (no guidance) and 1.13 at both 256×256 and 512×512 (with guidance). RAE offers clear advantages and should be the new default for diffusion transformer training.

潜在生成建模已成为扩散Transformer的标准策略,其中预训练的自编码器将像素映射到潜在空间以进行扩散过程;然而,自编码器组件几乎没有发展。大多数DiT仍然依赖原始的VAE编码器,这引入了一些局限性:过时的骨干网络损害了架构的简洁性,低维潜在空间限制了信息容量,以及纯粹基于重建的训练导致的弱表征,最终限制了生成质量。在这项工作中,我们探索用预训练的表征编码器(例如 DINO、SigLIP、MAE)搭配训练好的解码器来替换 VAE,形成了我们称之为表征自编码器(RAE)的模型。这些模型既提供了高质量的重建,又提供了语义丰富的潜在空间,同时允许采用可扩展的基于Transformer的架构。由于这些潜在空间通常是高维的,一个关键挑战是使扩散Transformer能够在其中有效运行。我们分析了这一困难的根源,提出了有理论依据的解决方案,并通过实验进行了验证。我们的方法在没有辅助表征对齐损失的情况下实现了更快的收敛。使用配备轻量级宽 DDT 头的 DiT 变体,我们在 ImageNet 上取得了强大的图像生成结果:在无引导下,256×256 分辨率的 FID 为 1.51;在有引导下,256×256 和 512×512 分辨率下均为 1.13。RAE 具有明显的优势,应该成为扩散Transformer训练的新默认选择。


PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

https://huggingface.co/papers/2510.14528

https://arxiv.org/abs/2510.14528

https://github.com/PaddlePaddle/PaddleOCR

百度

PaddleOCR-VL:通过 0.9B 超紧凑视觉语言模型提升多语言文档解析能力

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

在本报告中,我们提出了 PaddleOCR-VL,一个面向文档解析的最先进且资源高效的模型。其核心组件是 PaddleOCR-VL-0.9B,一个紧凑而强大的视觉语言模型,它集成了 NaViT 风格的动态分辨率视觉编码器和 ERNIE-4.5-0.3B 语言模型,以实现精确的元素识别。这一创新模型高效支持 109 种语言,并在识别复杂元素方面表现出色,同时保持了极低的资源消耗。通过在广泛使用的公共基准和内部基准上的全面评估,PaddleOCR-VL 在页面级文档解析和元素级识别方面均达到了最先进的性能。它显著优于现有解决方案,与顶级视觉语言模型相比展现出强大的竞争力,并提供快速的推理速度。这些优势使其非常适合在实际场景中进行部署。


DeepSeek-OCR: Contexts Optical Compression

https://huggingface.co/papers/2510.18234

https://arxiv.org/abs/2510.18234

https://github.com/deepseek-ai/DeepSeek-OCR

DeepSeek

DeepSeek-OCR:上下文光学压缩

We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10×), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20×, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).

我们提出 DeepSeek-OCR,作为对通过光学二维映射压缩长上下文可行性的初步探索。DeepSeek-OCR 包含两个组件:DeepEncoder 和作为解码器的 DeepSeek3B-MoE-A570M。具体来说,DeepEncoder 作为核心引擎,旨在在高分辨率输入下保持低激活值,同时实现高压缩比,以确保视觉词元的数量达到最佳且可控。实验表明,当文本词元数量在视觉词元数量的 10 倍以内时,模型可以达到 97% 的解码精度。即使在 20 倍压缩比下,OCR 精度仍保持在约 60%。这为历史长上下文压缩和 LLM 中的记忆遗忘机制等研究领域展现了巨大的潜力。除此之外,DeepSeek-OCR 还展现出高度的实用价值。在 OmniDocBench 上,它仅用 100 个视觉词元就超越了 GOT-OCR2.0,并在使用少于 800 个视觉词元的情况下优于 MinerU2.0。在生产环境中,DeepSeek-OCR 每天可以生成用于 LLM/VLM 的训练数据,规模可达 20 万页以上。