LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer
ICLR2026GitHub Stars 1.2kDiffusion
LucidFlux:通过大规模扩散 Transformer 实现免描述的照片级真实图像修复
Abstract
Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.
图像修复(IR)旨在恢复由未知混合退化造成的图像,同时保持语义信息;在这类条件下,判别式修复器和基于 UNet 的扩散先验往往会过度平滑、产生幻觉或发生语义漂移。我们提出 LucidFlux,这是一种无需图像描述的 IR 框架,可在不使用图像 caption 的情况下适配大规模扩散 Transformer(Flux.1)。LucidFlux 引入轻量级双分支条件器,分别注入来自退化输入和轻量修复 proxy 的信号,以锚定几何结构并抑制伪影。随后,我们设计了时间步与层级自适应的调制策略,在骨干网络层级中路由这些线索,从而产生由粗到细、上下文感知的更新,在保护全局结构的同时恢复纹理。之后,为避免文本提示或视觉语言模型(VLM)caption 带来的延迟与不稳定性,我们通过从 proxy 中提取的 SigLIP 特征实现免 caption 的语义对齐。一个可扩展的数据筛选流程进一步过滤大规模数据,以获得结构信息丰富的监督。在合成和真实场景基准上,LucidFlux 持续优于强大的开源和商业基线,消融实验也验证了各组件的必要性。LucidFlux 表明,对于大规模 DiT 而言,何时、何处以及用什么进行条件化,而不是增加参数或依赖文本提示,才是在真实场景中实现鲁棒、免 caption 图像修复的关键杠杆。
Precise Object and Effect Removal with Adaptive Target-Aware Attention
CVPR2026GitHub Stars 577DiffusionInpainting
基于自适应目标感知注意力的精确物体与效应移除
Abstract
Object removal requires eliminating not only the target object but also its associated visual effects such as shadows and reflections. However, diffusion-based inpainting and removal methods often introduce artifacts, hallucinate contents, alter background, and struggle to remove object effects accurately. To address these challenges, we propose ObjectClear, a novel framework that decouples foreground removal from background reconstruction via an adaptive target-aware attention mechanism. This design empowers the model to precisely localize and remove both objects and their effects while maintaining high background fidelity. Moreover, the learned attention maps are leveraged for an attention-guided fusion strategy during inference, further enhancing visual consistency. To facilitate the training and evaluation, we construct OBER, a large-scale dataset for OBject-Effect Removal, which provides paired images with and without object-effects, along with precise masks for both objects and their effects. The dataset comprises high-quality captured and simulated data, covering diverse objects, effects, and complex multi-object scenes. Extensive experiments demonstrate that ObjectClear outperforms prior methods, achieving superior object-effect removal quality and background fidelity, especially in challenging scenarios.
物体移除不仅需要消除目标物体,还需要消除与其相关的视觉效应,例如阴影和反射。然而,基于扩散的图像补全与移除方法常常会引入伪影、生成幻觉内容、改变背景,并且难以准确移除物体效应。为应对这些挑战,我们提出 ObjectClear,这是一种新型框架,通过自适应目标感知注意力机制将前景移除与背景重建解耦。该设计使模型能够精确定位并移除物体及其效应,同时保持较高的背景保真度。此外,在推理阶段,我们利用学习到的注意力图进行注意力引导融合,进一步增强视觉一致性。为便于训练和评估,我们构建了 OBER,这是一个面向物体效应移除(OBject-Effect Removal)的大规模数据集,提供带有和不带有物体效应的成对图像,并为物体及其效应提供精确 mask。该数据集由高质量实拍和模拟数据组成,覆盖多样化物体、效应以及复杂多物体场景。大量实验证明,ObjectClear 优于以往方法,尤其在具有挑战性的场景中实现了更好的物体效应移除质量和背景保真度。
EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing
CVPR2026 HighlightGitHub Stars 109DiffusionVideo Inpainting
EffectErase:用于高质量效应擦除的联合视频物体移除与插入
Abstract
Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.
视频物体移除旨在消除动态目标物体及其视觉效应,例如形变、阴影和反射,同时恢复无缝背景。近期基于扩散的视频补全和物体移除方法可以移除物体,但通常难以擦除这些效应并合成一致的背景。除方法局限外,该方向的发展还受到数据集缺乏的限制:目前缺少能够在多样环境中系统捕获常见物体效应、用于训练和评估的综合数据集。为此,我们引入 VOR(Video Object Removal),这是一个大规模数据集,提供多样化的成对视频;每一对视频中,一段包含目标物体及其效应,另一段则不存在该物体和效应,并配有相应物体 mask。VOR 包含来自实拍和合成来源的 6 万对高质量视频,覆盖五类效应、多种物体类别以及复杂动态多物体场景。基于 VOR,我们提出 EffectErase,这是一种效应感知的视频物体移除方法,在互惠学习框架中将视频物体插入视为移除的逆向辅助任务。该模型包含任务感知区域引导,使学习聚焦于受影响区域,并支持灵活的任务切换。随后,我们设计插入-移除一致性目标,以鼓励互补行为,并共享效应区域和结构线索的定位。经过 VOR 训练后,EffectErase 在大量实验中取得优越性能,能够在多种场景下实现高质量的视频物体效应擦除。
HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
CVPR2026GitHub Stars 83DiffusionReference-Based
HiFi-Inpaint:迈向用于生成细节保持人-商品图像的高保真参考式图像补全
Abstract
Human-product images, which showcase the integration of humans and products, play a vital role in advertising, e-commerce, and digital marketing. The essential challenge of generating such images lies in ensuring the high-fidelity preservation of product details. Among existing paradigms, reference-based inpainting offers a targeted solution by leveraging product reference images to guide the inpainting process. However, limitations remain in three key aspects: the lack of diverse large-scale training data, the struggle of current models to focus on product detail preservation, and the inability of coarse supervision for achieving precise guidance. To address these issues, we propose HiFi-Inpaint, a novel high-fidelity reference-based inpainting framework tailored for generating human-product images. HiFi-Inpaint introduces Shared Enhancement Attention (SEA) to refine fine-grained product features and Detail-Aware Loss (DAL) to enforce precise pixel-level supervision using high-frequency maps. Additionally, we construct a new dataset, HP-Image-40K, with samples curated from self-synthesis data and processed with automatic filtering. Experimental results show that HiFi-Inpaint achieves state-of-the-art performance, delivering detail-preserving human-product images.
人-商品图像展示了人物与商品的融合,在广告、电商和数字营销中发挥着重要作用。生成此类图像的核心挑战在于确保商品细节的高保真保留。在现有范式中,基于参考图的图像补全通过利用商品参考图来指导补全过程,提供了一种有针对性的解决方案。然而,该方向仍存在三个关键限制:缺乏多样化的大规模训练数据;当前模型难以聚焦于商品细节保留;粗粒度监督无法实现精确引导。为解决这些问题,我们提出 HiFi-Inpaint,这是一种面向人-商品图像生成的高保真参考图像补全框架。HiFi-Inpaint 引入共享增强注意力(SEA)来细化商品的细粒度特征,并提出细节感知损失(DAL),利用高频图实现精确的像素级监督。此外,我们构建了新数据集 HP-Image-40K,其样本来自自合成数据,并经过自动过滤处理。实验结果表明,HiFi-Inpaint 达到了最先进性能,能够生成保留细节的人-商品图像。
FAPE-IR: Frequency-Aware Planning and Execution Framework for All-in-One Image Restoration
CVPR2026GitHub Stars 24AIORMoE
FAPE-IR:用于一体化图像修复的频率感知规划与执行框架
Abstract
All-in-One Image Restoration (AIO-IR) aims to develop a unified model that can handle multiple degradations under complex conditions. However, existing methods often rely on task-specific designs or latent routing strategies, making it hard to adapt to real-world scenarios with various degradations. We propose FAPE-IR, a Frequency-Aware Planning and Execution framework for image restoration. It uses a frozen Multimodal Large Language Model (MLLM) as a planner to analyze degraded images and generate concise, frequency-aware restoration plans. These plans guide a LoRA-based Mixture-of-Experts (LoRA-MoE) module within a diffusion-based executor, which dynamically selects high- or low-frequency experts, complemented by frequency features of the input image. To further improve restoration quality and reduce artifacts, we introduce adversarial training and a frequency regularization loss. By coupling semantic planning with frequency-based restoration, FAPE-IR offers a unified and interpretable solution for all-in-one image restoration. Extensive experiments show that FAPE-IR achieves state-of-the-art performance across seven restoration tasks and exhibits strong zero-shot generalization under mixed degradations.
一体化图像修复(AIO-IR)旨在开发能够在复杂条件下处理多种退化的统一模型。然而,现有方法通常依赖任务特定设计或潜在路由策略,使其难以适应包含多样退化的真实场景。我们提出 FAPE-IR,一个用于图像修复的频率感知规划与执行框架。该方法使用冻结的多模态大语言模型(MLLM)作为规划器,分析退化图像并生成简洁的频率感知修复计划。这些计划用于指导扩散执行器中的基于 LoRA 的混合专家(LoRA-MoE)模块,使其结合输入图像的频率特征,动态选择高频或低频专家。为进一步提升修复质量并减少伪影,我们引入对抗训练和频率正则化损失。通过将语义规划与基于频率的修复耦合,FAPE-IR 为一体化图像修复提供了统一且可解释的解决方案。大量实验表明,FAPE-IR 在七项修复任务上达到最先进性能,并在混合退化下展现出强大的零样本泛化能力。
Restore, Assess, Repeat: A Unified Framework for Iterative Image Restoration
CVPR2026GitHub Stars 11AIORIQA
恢复、评估、重复:用于迭代式图像修复的统一框架
Abstract
Image restoration aims to recover high quality images from inputs degraded by various factors, such as adverse weather, blur, or low light. While recent studies have shown remarkable progress across individual or unified restoration tasks, they still suffer from limited generalization and inefficiency when handling unknown or composite degradations. To address these limitations, we propose RAR, a Restore, Assess and Repeat process, that integrates Image Quality Assessment (IQA) and Image Restoration (IR) into a unified framework to iteratively and efficiently achieve high quality image restoration. Specifically, we introduce a restoration process that operates entirely in the latent domain to jointly perform degradation identification, image restoration, and quality verification. The resulting model is fully trainable end to end and allows for an all-in-one assess and restore approach that dynamically adapts the restoration process. Also, the tight integration of IQA and IR into a unified model minimizes the latency and information loss that typically arises from keeping the two modules disjoint, (e.g. during image and/or text decoding). Extensive experiments show that our approach consistent improvements under single, unknown and composite degradations, thereby establishing a new state-of-the-art.
图像修复旨在从受多种因素退化的输入中恢复高质量图像,例如恶劣天气、模糊或低光照。尽管近期研究在单一或统一修复任务上取得了显著进展,但在处理未知或复合退化时,仍存在泛化能力有限和效率不足的问题。为解决这些限制,我们提出 RAR,即恢复、评估并重复(Restore, Assess and Repeat)流程,将图像质量评估(IQA)和图像修复(IR)整合到统一框架中,以迭代且高效地实现高质量图像修复。具体而言,我们引入一种完全在潜在域中运行的修复流程,联合执行退化识别、图像修复和质量验证。所得模型可以端到端训练,并支持一体化的评估与修复方法,使修复过程能够动态自适应。此外,将 IQA 与 IR 紧密整合到统一模型中,可最大限度减少将两个模块分离时通常产生的延迟和信息损失,例如图像和/或文本解码过程中的损失。大量实验表明,我们的方法在单一、未知和复合退化下均带来稳定提升,从而建立了新的最先进水平。
Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration
CVPR2026GitHub Stars 17SSMUHD
扫描聚类而非像素:用于高效超高清图像修复的聚类中心范式
Abstract
Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C
超高清(UHD)图像修复正陷入可扩展性危机:现有模型受限于逐像素操作,需要难以承受的计算量。尽管 Mamba 等状态空间模型(SSM)承诺线性复杂度,但其逐像素串行扫描对于包含数百万像素的 UHD 内容而言仍是根本瓶颈。我们提出一个问题:理解图像是否必须处理每个像素?本文提出 C
ShiftLUT: Spatial Shift Enhanced Look-Up Tables for Efficient Image Restoration
CVPR2026GitHub Stars 21Efficient
ShiftLUT:用于高效图像修复的空间偏移增强查找表
Abstract
Look-Up Table based methods have emerged as a promising direction for efficient image restoration tasks. Recent LUT-based methods focus on improving their performance by expanding the receptive field. However, they inevitably introduce extra computational and storage overhead, which hinders their deployment in edge devices. To address this issue, we propose ShiftLUT, a novel framework that attains the largest receptive field among all LUT-based methods while maintaining high efficiency. Our key insight lies in three complementary components. First, Learnable Spatial Shift module (LSS) is introduced to expand the receptive field by applying learnable, channel-wise spatial offsets on feature maps. Second, we propose an asymmetric dual-branch architecture that allocates more computation to the information-dense branch, substantially reducing inference latency without compromising restoration quality. Finally, we incorporate a feature-level LUT compression strategy called Error-bounded Adaptive Sampling (EAS) to minimize the storage overhead. Compared to the previous state-of-the-art method TinyLUT, ShiftLUT achieves a 3.8
基于查找表(LUT)的方法已成为高效图像修复任务中一个有前景的方向。近期基于 LUT 的方法主要通过扩大感受野来提升性能。然而,这些方法不可避免地引入额外的计算和存储开销,从而阻碍其在边缘设备上的部署。为解决这一问题,我们提出 ShiftLUT,这是一个新型框架,在保持高效率的同时,实现了所有基于 LUT 方法中最大的感受野。我们的关键洞察来自三个互补组件。首先,引入可学习空间偏移模块(LSS),通过对特征图施加可学习的逐通道空间偏移来扩大感受野。其次,我们提出一种非对称双分支架构,将更多计算分配给信息密集分支,在不损害修复质量的情况下显著降低推理延迟。最后,我们引入一种名为误差有界自适应采样(EAS)的特征级 LUT 压缩策略,以最小化存储开销。与此前最先进方法 TinyLUT 相比,ShiftLUT 实现了 3.8
FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model
CVPR2026GitHub Stars 32Foundation ModelMoE
FoundIR-v2:优化图像修复基础模型的预训练数据混合
Abstract
Recent studies have witnessed significant advances in image restoration foundation models driven by improvements in the scale and quality of pre-training data. In this work, we find that the data mixture proportions from different restoration tasks are also a critical factor directly determining the overall performance of all-in-one image restoration models. To this end, we propose a high-capacity diffusion-based image restoration foundation model, FoundIR-v2, which adopts a data equilibrium scheduling paradigm to dynamically optimize the proportions of mixed training datasets from different tasks. By leveraging the data mixing law, our method ensures a balanced dataset composition, enabling the model to achieve consistent generalization and comprehensive performance across diverse tasks. Furthermore, we introduce an effective Mixture-of-Experts (MoE)-driven scheduler into generative pre-training to flexibly allocate task-adaptive diffusion priors for each restoration task, accounting for the distinct degradation forms and levels exhibited by different tasks. Extensive experiments demonstrate that our method can address over 50 sub-tasks across a broader scope of real-world scenarios and achieves favorable performance against state-of-the-art approaches.
近期研究表明,随着预训练数据规模和质量的提升,图像修复基础模型取得了显著进展。在本文中,我们发现,不同修复任务的数据混合比例也是直接决定一体化图像修复模型整体性能的关键因素。为此,我们提出 FoundIR-v2,这是一种基于扩散的高容量图像修复基础模型,采用数据均衡调度范式来动态优化来自不同任务的混合训练数据比例。通过利用数据混合法则,我们的方法确保了平衡的数据集组成,使模型能够在多样任务上获得稳定泛化能力和综合性能。此外,我们在生成式预训练中引入了一种有效的混合专家(MoE)驱动调度器,以针对不同任务表现出的不同退化形式和程度,为每个修复任务灵活分配任务自适应扩散先验。大量实验证明,我们的方法能够覆盖更广泛真实场景中的 50 多个子任务,并相较最先进方法取得有竞争力的性能。
Beyond Ground-Truth: Leveraging Image Quality Priors for Real-World Image Restoration
CVPR2026GitHub Stars 8IQA
超越真值:利用图像质量先验进行真实世界图像修复
Abstract
Real-world image restoration aims to restore high-quality (HQ) images from degraded low-quality (LQ) inputs captured under uncontrolled conditions. Existing methods typically depend on ground-truth (GT) supervision, assuming that GT provides perfect reference quality. However, GT can still contain images with inconsistent perceptual fidelity, causing models to converge to the average quality level of the training data rather than achieving the highest perceptual quality attainable. To address these problems, we propose a novel framework, termed IQPIR, that introduces an Image Quality Prior (IQP)-extracted from pre-trained No-Reference Image Quality Assessment (NR-IQA) models-to guide the restoration process toward perceptually optimal outputs explicitly. Our approach synergistically integrates IQP with a learned codebook prior through three key mechanisms: (1) a quality-conditioned Transformer, where NR-IQA-derived scores serve as conditioning signals to steer the predicted representation toward maximal perceptual quality. This design provides a plug-and-play enhancement compatible with existing restoration architectures without structural modification; and (2) a dual-branch codebook structure, which disentangles common and HQ-specific features, ensuring a comprehensive representation of both generic structural information and quality-sensitive attributes; and (3) a discrete representation-based quality optimization strategy, which mitigates over-optimization effects commonly observed in continuous latent spaces. Extensive experiments on real-world image restoration demonstrate that our method not only surpasses cutting-edge methods but also serves as a generalizable quality-guided enhancement strategy for existing methods. The code is available.
真实世界图像修复旨在从无约束条件下采集的退化低质量(LQ)输入中恢复高质量(HQ)图像。现有方法通常依赖真值(GT)监督,并假设 GT 提供完美的参考质量。然而,GT 中仍可能包含感知保真度不一致的图像,导致模型收敛到训练数据的平均质量水平,而非达到可获得的最高感知质量。为解决这些问题,我们提出一种新框架 IQPIR,引入从预训练无参考图像质量评估(NR-IQA)模型中提取的图像质量先验(IQP),显式引导修复过程朝向感知最优输出。我们的方法通过三种关键机制将 IQP 与学习到的 codebook 先验协同整合:(1)质量条件 Transformer,其中 NR-IQA 得分作为条件信号,引导预测表示趋向最大感知质量。该设计可作为即插即用增强模块,与现有修复架构兼容,无需结构修改;(2)双分支 codebook 结构,用于解耦通用特征和 HQ 特定特征,确保对通用结构信息与质量敏感属性进行全面表示;(3)基于离散表示的质量优化策略,用于缓解连续潜在空间中常见的过优化效应。真实世界图像修复上的大量实验表明,我们的方法不仅超越了前沿方法,也可作为适用于现有方法的通用质量引导增强策略。代码已开放。
LearnIR: Learnable Posterior Sampling for Real-World Image Restoration
ICLR2026GitHub Stars 4DiffusionZero-Shot
LearnIR:用于真实世界图像修复的可学习后验采样
Abstract
Image restoration in real-world conditions is highly challenging due to heterogeneous degradations such as haze, noise, shadows, and blur. Existing diffusion-based methods remain limited: conditional generation struggles to balance fidelity and realism, inversion-based approaches accumulate errors, and posterior sampling requires a known forward operator that is rarely available. We introduce LearnIR, a learnable diffusion posterior sampling framework that eliminates this dependency by training a lightweight model to directly predict gradient correction distributions, enabling Diffusion Posterior Sampling Correction (DPSC) that maintains consistency with the true image distribution during sampling. In addition, a Dynamic Resolution Module (DRM) dynamically adjusts resolution to preserve global structures in early stages and refine fine textures later, while avoiding the need for a pretrained VAE. Experiments on ISTD, O-HAZE, HazyDet, REVIDE, and our newly constructed FaceShadow dataset show that LearnIR achieves state-of-the-art performance in PSNR, SSIM, and LPIPS.
真实条件下的图像修复极具挑战性,因为图像往往受到雾、噪声、阴影和模糊等异质退化影响。现有基于扩散的方法仍存在限制:条件生成难以平衡保真度与真实性,基于反演的方法会累积误差,而后验采样需要已知的前向算子,但真实场景中往往难以获得。我们提出 LearnIR,一种可学习的扩散后验采样框架,通过训练轻量级模型直接预测梯度校正分布,消除了对已知前向算子的依赖,从而实现 Diffusion Posterior Sampling Correction (DPSC),在采样过程中保持与真实图像分布的一致性。此外,Dynamic Resolution Module (DRM) 动态调整分辨率,以在早期阶段保持全局结构,并在后期细化精细纹理,同时避免对预训练 VAE 的需求。在 ISTD、O-HAZE、HazyDet、REVIDE 以及我们新构建的 FaceShadow 数据集上的实验表明,LearnIR 在 PSNR、SSIM 和 LPIPS 上达到了最先进性能。