Ruibin Li1,2 | Tao Yang3 | Song Guo4 | Lei Zhang1,2 |
1The Hong Kong Polytechnic University, 2OPPO Research Institute, 3ByteDance, 4The Hong Kong University of Science and Technology.
β: If RORem is helpful to you, please help star this repo. Thanks! π€
- β RORem Dataset
- β Training Code
- β Inference Code
- β RORem Model, LoRA, Discriminator
- β RORem Diffuser
- β Update Dataset to Hugging Face
- β¬οΈ Create Hugging Face Demo
- β¬οΈ Simplify Inference Code
git clone https://github.com/leeruibin/RORem.git
cd RORem
conda env create -n environment.yaml
conda activate RORem
Install xformers to speedup the training, note that the xformers version should match torch version.
pip install xformers==0.0.28.post3
We use wandb to record the intermediate state during the training process, so make sure you finish the following process
pip install wandb
wandb login
enter the WANDB_API_KEY in the shell or direct export WANDB_API_KEY= to the environment variable.
The RORem dataset are now available at LetsThink/RORem_dataset.
Dataset | Download |
---|---|
RORem&RORD | Google cloud (73.15GB) |
Mulan | Google cloud (3.26GB) |
Final HR | Google cloud (7.9GB) |
Please note that we employed the SafeStableDiffusionSafetyChecker to filter out inappropriate content, which may result in minor discrepancies between the final image-text pairs and those presented in the original paper.
For each dataset, we build folder structure as:
.
βββ source
βββ mask
βββ GT
βββ meta.json #
The meta.json file record the triple as:
{"source":"source/xxx.png","mask":"mask/xxx.png","GT":"GT/xxx.png"}
By path the absolute path of meta.json, the training script can parse the path of each triple.
Model Checkpoint | Download |
---|---|
RORem | Google cloud |
RORem-mixed | Google cloud |
RORem-LCM | Google cloud |
RORem-Discriminator | Google cloud |
from diffusers import AutoPipelineForInpainting
from myutils.img_util import dilate_mask
resolution = 512
dilate_size = 20
use_CFG = True
pipe_edit = AutoPipelineForInpainting.from_pretrained(
"LetsThink/RORem",
torch_dtype=torch.float16,
variant="fp16"
)
input_image = load_image(input_path).resize((resolution,resolution))
input_mask = load_image(mask_path).resize((resolution,resolution))
if args.dilate_size != 0:
mask_image = dilate_mask(mask_image,dilate_size)
height = width = resolution
if not args.use_CFG:
prompts = ""
Removal_result = pipe_edit(
prompt=prompts,
height=height,
width=width,
image=input_image,
mask_image=input_mask,
guidance_scale=1.,
num_inference_steps=50, # steps between 15 and 30 also work well
strength=0.99, # make sure to use `strength` below 1.0
).images[0]
else:
# we also find by adding these prompt, the model can work even better
prompts = "4K, high quality, masterpiece, Highly detailed, Sharp focus, Professional, photorealistic, realistic"
negative_prompts = "low quality, worst, bad proportions, blurry, extra finger, Deformed, disfigured, unclear background"
Removal_result = pipe_edit(
prompt=prompts,
negative_prompt=negative_prompts,
height=height,
width=width,
image=input_image,
mask_image=input_mask,
guidance_scale=1.,
num_inference_steps=50, # steps between 15 and 30 also work well
strength=0.99, # make sure to use `strength` below 1.0
).images[0]
To run RORem inference, prepare an input image and a mask image, then run:
python inference_RORem.py
--pretrained_model diffusers/stable-diffusion-xl-1.0-inpainting-0.1
--RORem_unet xxx # RORem unet checkpoint
--image_path xxx.png
--mask_path xxx_mask.png
--save_path result/output.png
--use_CFG true
--dilate_size 0 # optional: dilate the mask
Here, we present two versions of RORem UNet:
- The RORem model, which achieves optimal performance with an image resolution of 512x512.
- The RORem-mixed model, trained on a mixed resolution of 512x512 and 1024x1024, delivers superior performance when processing images larger than 512x512.
Additionally, we have observed that incorporating content-irrelevant prompts and leveraging Classifier-Free Guidance (CFG) further enhances removal performance, surpassing the results reported in the original paper.
To run RORem-4S inference, download the RORem-LCM LoRA, then run:
python inference_RORem_4S.py
--pretrained_model diffusers/stable-diffusion-xl-1.0-inpainting-0.1
--RORem_unet xxx # RORem unet checkpoint
--RORem_LoRA xxx # RORem LoRA checkpoint
--image_path xxx.png
--mask_path xxx_mask.png
--inference_steps 4
--save_path result/output.png
--use_CFG true
--dilate_size 0 # optional: dilate the mask
To run RORem-discrminator, download the RORem-Discriminator, then run:
python inference_RORem_discrminator.py
--pretrained_model diffusers/stable-diffusion-xl-1.0-inpainting-0.1
--RORem_discriminator xxx
--image_path xxx.png
--mask_path xxx_mask.png
--edited_path xxx.png
accelerate launch \
--multi_gpu \
--num_processes 8 \
train_RORem.py \
--train_batch_size 16 \
--output_dir <your_path_to_save_checkpoint> \
--meta_path xxx/Final_open_RORem/meta.json \
--max_train_steps 50000 \
--random_flip \
--resolution 512 \
--pretrained_model_name_or_path diffusers/stable-diffusion-xl-1.0-inpainting-0.1 \
--mixed_precision fp16 \
--checkpoints_total_limit 5 \
--checkpointing_steps 5000 \
--learning_rate 5e-5 \
--validation_steps 2000 \
--seed 4 \
--report_to wandb \
Using Deepspeed zero2 requires less GPU memory.
accelerate launch --config_file config/deepspeed_config.yaml \
--multi_gpu \
--num_processes 8 \
train_RORem.py \
--train_batch_size 16 \
--output_dir <your_path_to_save_checkpoint> \
--meta_path xxx/Final_open_RORem/meta.json \
--max_train_steps 50000 \
--random_flip \
--resolution 512 \
--pretrained_model_name_or_path diffusers/stable-diffusion-xl-1.0-inpainting-0.1 \
--mixed_precision fp16 \
--checkpoints_total_limit 5 \
--checkpointing_steps 5000 \
--learning_rate 5e-5 \
--validation_steps 2000 \
--seed 4 \
--report_to wandb \
OR you can directly submit the training shell as:
bash run_train_RORem.sh
accelerate launch \
train_RORem_lcm.py \
--multi_gpu \
--num_processes 8 \
--pretrained_teacher_unet xxx \
--output_dir experiment/RORem_LCM
OR you can directly submit the training shell as:
bash run_train_RORem_LCM.sh
In order to train RORem-Discriminator, you should add "score" to each triple which will be
[
{"source":"source/xxx.png","mask":"mask/xxx.png","GT":"GT/xxx.png", "score":1},
{"source":"source/xxx.png","mask":"mask/xxx.png","GT":"GT/xxx.png", "score":0},
]
Then you can directly submit the training shell as:
bash run_train_RORem_discriminator.sh
Overview of our training data generation and model training process. In stage 1, we gather 60K training triplets from open-source datasets to train an initial removal model. In stage 2, we apply the trained model to a test set and engage human annotators to select high-quality samples to augment the training set. In stage 3, we train a discriminator using the human feedback data, and employ it to automatically annotate high quality training samples. We iterate stages 2&3 for several rounds, ultimately obtaining over 200K object removal training triplets as well as the trained model.
We invite human annotators to evaluate the success rate of different methods. Furthermore, by refining our discriminator, we can see that the success rates estimated by
This project is released under the Apache 2.0 license.
@article{li2024RORem,
title={RORem: Training a Robust Object Remover with Human-in-the-Loop},
author={Ruibin Li and Tao, Yang and Song, Guo and Lei, Zhang},
year={2025},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition},
}
This implementation is developed based on the diffusers library, LCM and utilizes the Stable Diffusion XL-inpainting model. We would like to express our gratitude to the open-source community for their valuable contributions.