Robust Representation Learning via Adaptive Mask for All-in-One Image Restoration

Zilong Zhang*,1, Chujie Qin*,1, Chunle Guo1, Yong Zhang2,1, Chao Xue3, Ming-Ming Cheng1, Chongyi Li†,1
1VPIC,CS, Nankai University, 2Chongqing Chang'an Wangjiang Industrial Group Co., Ltd, 3Tiandy Technologies,
*Equal contribution;Corresponding author

Abstract

This work presents Robust Representation Learning via Adaptive Mask (RAM++), a two-stage framework for all-in-one image restoration. RAM++ integrates high-level semantic understanding with low-level texture generation to achieve content-oriented robust restoration. It addresses the limitations of existing degradation-oriented methods in extreme scenarios (e.g., degradations are strongly coupled with image structures). RAM++ also mitigates common challenges such as unbalanced performance across tasks, overfitting to seen degradations, and weak generalization to unseen ones through three key designs:

  1. Adaptive Semantic-Aware Mask (AdaSAM): a pre-training strategy that applies pixel-level masks to semantically rich and textured regions. This design enables the network to learn both generative priors and image content priors from various degradations.
  2. Mask Attribute Conductance (MAC): a selective fine-tuning strategy that adjusts the layers with higher contributions to bridge the integrity gap between masked pre-training and full-image fine-tuning, while retaining learned priors.
  3. Robust Feature Regularization (RFR): a strategy that leverages DINOv2’s semantically consistent and degradation-invariant representations, together with efficient feature fusion, to achieve faithful and semantically coherent restoration.

With these designs, RAM++ achieves robust, well-balanced, and state-of-the-art performance across seen, unseen, extreme, and mixed degradations.

Method Overview

MY ALT TEXT

An illustration of our overall pipeline. 1) Pre-training the model with the Adaptive Semantic-Aware Mask image method tailored to low-level vision. We mask the degraded images’ semantically and texturally rich regions (\ie, high-information regions) at the pixel level with a $50\%$ masking ratio and reconstruct the clean images. 2) Fine-tuning is performed to bridge the input integrity gap that arises when transitioning from masked inputs during pre-training to full images during inference. We assess the contribution of each network layer to addressing this gap using the proposed MAC, ranking them in descending order. The top $k\%$ of layers are then selected for fine-tuning on complete images. 3) The fine-tuning process is further assisted by a pre-trained vision foundation model, providing semantic consistency and degradation-invariant priors

Visual Comparison

Image 1
Image 2

Quantitative Comparison

Quantitative Results

BibTeX

@misc{zhang2025ramrobustrepresentationlearning,
      title={RAM++: Robust Representation Learning via Adaptive Mask for All-in-One Image Restoration}, 
      author={Zilong Zhang and Chujie Qin and Chunle Guo and Yong Zhang and Chao Xue and Ming-Ming Cheng and Chongyi Li},
      year={2025},
      eprint={2509.12039},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.12039}, 
}