Skip to content

leeruibin/RORem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RORem: Training a Robust Object Remover with Human-in-the-Loop

Ruibin Li1,2 | Tao Yang3 | Song Guo4 | Lei Zhang1,2 |

1The Hong Kong Polytechnic University, 2OPPO Research Institute, 3ByteDance, 4The Hong Kong University of Science and Technology.

⭐: If RORem is helpful to you, please help star this repo. Thanks! πŸ€—

πŸ“Œ Progress Checklist

  • βœ… RORem Dataset
  • βœ… Training Code
  • βœ… Inference Code
  • βœ… RORem Model, LoRA, Discriminator
  • βœ… RORem Diffuser
  • βœ… Update Dataset to Hugging Face
  • ⬜️ Create Hugging Face Demo
  • ⬜️ Simplify Inference Code

πŸ˜ƒ prepare enviroment

git clone https://github.com/leeruibin/RORem.git
cd RORem
conda env create -n environment.yaml
conda activate RORem

Install xformers to speedup the training, note that the xformers version should match torch version.

pip install xformers==0.0.28.post3

We use wandb to record the intermediate state during the training process, so make sure you finish the following process

pip install wandb
wandb login

enter the WANDB_API_KEY in the shell or direct export WANDB_API_KEY= to the environment variable.

⭐ Download RORem dataset from huggingface

The RORem dataset are now available at LetsThink/RORem_dataset.

⭐ Download RORem dataset

Dataset Download
RORem&RORD Google cloud (73.15GB)
Mulan Google cloud (3.26GB)
Final HR Google cloud (7.9GB)

Please note that we employed the SafeStableDiffusionSafetyChecker to filter out inappropriate content, which may result in minor discrepancies between the final image-text pairs and those presented in the original paper.

For each dataset, we build folder structure as:

.
β”œβ”€β”€ source
β”œβ”€β”€ mask
β”œβ”€β”€ GT
└── meta.json #

The meta.json file record the triple as:

{"source":"source/xxx.png","mask":"mask/xxx.png","GT":"GT/xxx.png"}

By path the absolute path of meta.json, the training script can parse the path of each triple.

πŸ”₯ Inference

Model Checkpoint Download
RORem Google cloud
RORem-mixed Google cloud
RORem-LCM Google cloud
RORem-Discriminator Google cloud

RORem Diffuser

from diffusers import AutoPipelineForInpainting
from myutils.img_util import dilate_mask

resolution = 512
dilate_size = 20
use_CFG = True
pipe_edit = AutoPipelineForInpainting.from_pretrained(
        "LetsThink/RORem",
        torch_dtype=torch.float16, 
        variant="fp16"
    )
input_image = load_image(input_path).resize((resolution,resolution))
input_mask = load_image(mask_path).resize((resolution,resolution))
if args.dilate_size != 0:
    mask_image = dilate_mask(mask_image,dilate_size)
height = width = resolution
if not args.use_CFG:
    prompts = ""
    Removal_result = pipe_edit(
            prompt=prompts,
            height=height,
            width=width,
            image=input_image,
            mask_image=input_mask,
            guidance_scale=1.,
            num_inference_steps=50,  # steps between 15 and 30 also work well
            strength=0.99,  # make sure to use `strength` below 1.0
        ).images[0]
else:
    # we also find by adding these prompt, the model can work even better
    prompts = "4K, high quality, masterpiece, Highly detailed, Sharp focus, Professional, photorealistic, realistic"
    negative_prompts = "low quality, worst, bad proportions, blurry, extra finger, Deformed, disfigured, unclear background"
    Removal_result = pipe_edit(
            prompt=prompts,
            negative_prompt=negative_prompts,
            height=height,
            width=width,
            image=input_image,
            mask_image=input_mask,
            guidance_scale=1.,
            num_inference_steps=50,  # steps between 15 and 30 also work well
            strength=0.99,  # make sure to use `strength` below 1.0
        ).images[0]


Run RORem

To run RORem inference, prepare an input image and a mask image, then run:

python inference_RORem.py
    --pretrained_model diffusers/stable-diffusion-xl-1.0-inpainting-0.1
    --RORem_unet xxx # RORem unet checkpoint
    --image_path xxx.png
    --mask_path xxx_mask.png
    --save_path result/output.png
    --use_CFG true
    --dilate_size 0 # optional: dilate the mask 

Here, we present two versions of RORem UNet:

  • The RORem model, which achieves optimal performance with an image resolution of 512x512.
  • The RORem-mixed model, trained on a mixed resolution of 512x512 and 1024x1024, delivers superior performance when processing images larger than 512x512.

Additionally, we have observed that incorporating content-irrelevant prompts and leveraging Classifier-Free Guidance (CFG) further enhances removal performance, surpassing the results reported in the original paper.

Run RORem-4S

To run RORem-4S inference, download the RORem-LCM LoRA, then run:

python inference_RORem_4S.py
    --pretrained_model diffusers/stable-diffusion-xl-1.0-inpainting-0.1
    --RORem_unet xxx # RORem unet checkpoint
    --RORem_LoRA xxx # RORem LoRA checkpoint
    --image_path xxx.png
    --mask_path xxx_mask.png
    --inference_steps 4
    --save_path result/output.png
    --use_CFG true
    --dilate_size 0 # optional: dilate the mask 

Run RORem-discriminator

To run RORem-discrminator, download the RORem-Discriminator, then run:

python inference_RORem_discrminator.py
    --pretrained_model diffusers/stable-diffusion-xl-1.0-inpainting-0.1
    --RORem_discriminator xxx
    --image_path xxx.png
    --mask_path xxx_mask.png
    --edited_path xxx.png

πŸ”₯ Training

To train RORem, with the following training script

accelerate launch \
    --multi_gpu \
    --num_processes 8 \
    train_RORem.py \
    --train_batch_size 16 \
    --output_dir <your_path_to_save_checkpoint> \
    --meta_path xxx/Final_open_RORem/meta.json \
    --max_train_steps 50000 \
    --random_flip \
    --resolution 512 \
    --pretrained_model_name_or_path diffusers/stable-diffusion-xl-1.0-inpainting-0.1 \
    --mixed_precision fp16 \
    --checkpoints_total_limit 5 \
    --checkpointing_steps 5000 \
    --learning_rate 5e-5 \
    --validation_steps 2000 \
    --seed 4 \
    --report_to wandb \

Using Deepspeed zero2 requires less GPU memory.

accelerate launch --config_file config/deepspeed_config.yaml \
    --multi_gpu \
    --num_processes 8 \
    train_RORem.py \
    --train_batch_size 16 \
    --output_dir <your_path_to_save_checkpoint> \
    --meta_path xxx/Final_open_RORem/meta.json \
    --max_train_steps 50000 \
    --random_flip \
    --resolution 512 \
    --pretrained_model_name_or_path diffusers/stable-diffusion-xl-1.0-inpainting-0.1 \
    --mixed_precision fp16 \
    --checkpoints_total_limit 5 \
    --checkpointing_steps 5000 \
    --learning_rate 5e-5 \
    --validation_steps 2000 \
    --seed 4 \
    --report_to wandb \

OR you can directly submit the training shell as:

bash run_train_RORem.sh

To train RORem-LCM, with the following training script

accelerate launch \
    train_RORem_lcm.py \
    --multi_gpu \
    --num_processes 8 \
    --pretrained_teacher_unet xxx \
    --output_dir experiment/RORem_LCM

OR you can directly submit the training shell as:

bash run_train_RORem_LCM.sh

To train RORem-Discriminator, with the following training script

In order to train RORem-Discriminator, you should add "score" to each triple which will be

[
{"source":"source/xxx.png","mask":"mask/xxx.png","GT":"GT/xxx.png", "score":1},
{"source":"source/xxx.png","mask":"mask/xxx.png","GT":"GT/xxx.png", "score":0},
]

Then you can directly submit the training shell as:

bash run_train_RORem_discriminator.sh

🌟 Overview Framework

pipeline

Overview of our training data generation and model training process. In stage 1, we gather 60K training triplets from open-source datasets to train an initial removal model. In stage 2, we apply the trained model to a test set and engage human annotators to select high-quality samples to augment the training set. In stage 3, we train a discriminator using the human feedback data, and employ it to automatically annotate high quality training samples. We iterate stages 2&3 for several rounds, ultimately obtaining over 200K object removal training triplets as well as the trained model.

number

🌟 Visual Results

Quantative comparsion

We invite human annotators to evaluate the success rate of different methods. Furthermore, by refining our discriminator, we can see that the success rates estimated by $D_{\phi}$ closely align with human annotations in the test set (with deviations less than 3% in most cases). This indicates that our trained $D_{\phi}$ effectively mirrors human preferences.

result

Qualitative Comparisons

result

License

This project is released under the Apache 2.0 license.

BibTeX

@article{li2024RORem,
  title={RORem: Training a Robust Object Remover with Human-in-the-Loop},
  author={Ruibin Li and Tao, Yang and Song, Guo and Lei, Zhang},
  year={2025},
  booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
}

Acknowledgements

This implementation is developed based on the diffusers library, LCM and utilizes the Stable Diffusion XL-inpainting model. We would like to express our gratitude to the open-source community for their valuable contributions.

statistics

visitors stars forks

About

[CVPR2025] RORem: Training a Robust Object Remover with Human-in-the-Loop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published