Donghao Zhou1*,
Jiancheng Huang2*,
Jinbin Bai3,
Jiaze Wang1,
Hao Chen1,
Guangyong Chen4,
Xiaowei Hu5†,
Pheng-Ann Heng1
1CUHK 2SIAT, CAS 3NUS 4Zhejiang Lab 5Shanghai AI Lab
We present MagicTailor to enable component-controllable personalization, a newly formulated task aiming to reconfigure specific components of concepts during personalization.
Abstract
Recent advancements in text-to-image (T2I) diffusion models have enabled the creation of high-quality images from text prompts, but they still struggle to generate images with precise control over specific visual concepts. Existing approaches can replicate a given concept by learning from reference images, yet they lack the flexibility for fine-grained customization of the individual component within the concept. In this paper, we introduce component-controllable personalization, a novel task that pushes the boundaries of T2I models by allowing users to reconfigure and personalize specific components of concepts. This task is particularly challenging due to two primary obstacles: semantic pollution, where unwanted visual elements corrupt the personalized concept, and semantic imbalance, which causes disproportionate learning of visual semantics. To overcome these challenges, we design MagicTailor, an innovative framework that leverages Dynamic Masked Degradation (DM-Deg) to dynamically perturb undesired visual semantics and Dual-Stream Balancing (DS-Bal) to establish a balanced learning paradigm for visual semantics. Extensive comparisons, ablations, and analyses demonstrate that MagicTailor not only excels in this challenging task but also holds significant promise for practical applications, paving the way for more nuanced and creative image generation.
- 2024.10: Our code is released! Feel free to contact me if anything is unclear.
- 2024.10: Our paper is available. The code is coming soon!
- Install the conda environment:
conda env create -f environment.yml
- Install other dependencies (here we take CUDA 11.6 as an example):
conda activate magictailor
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
- Clone the Grounded-SAM repository:
cd {PATH-TO-THIS-CODE}
git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git
- Follow the section of "Install without Docker" to set up Grounded-SAM (please make sure that the CUDA version used for the installation here is the same as that of PyTorch).
❗You can skip Step 3 and 4 if you just want to have a quick try using the example images we provide.
Directly use the example images in ./examples
, or you can prepare your own pair:
- Create a folder named
{CONCEPT}_{ID}+{COMPONENT}_{ID}
, where{CONCEPT}
and{COMPONENT}
are the category names for the concept and component respectively, and{ID}
is the customized index (you can set it to whatever you want) that helps you distinguish. - Put the reference images into this folder, and rename them as
0_{CONCEPT}_{ID}.jpg
and1_{COMPONENT}_{ID}.jpg
for the images of the concept and component respectively. - Finally, the data will be organized like:
person_a+hair_a/
├── 0_person_a0.jpg
├── 0_person_a1.jpg
├── 0_person_a2.jpg
├── 1_hair_a0.jpg
├── 1_hair_a1.jpg
└── 1_hair_a2.jpg
You can train MagicTailor with default hyperparameters:
python train.py --instance_data_dir {PATH-TO-PREPARED-DATA}
For example:
python train.py --instance_data_dir examples/person_k+hair_c
❗Please check the quality of the masks output by Grounded-SAM to ensure that the model runs correctly.
Alternatively, you can also train it with customized hyperparameters, such as:
python train.py
--instance_data_dir examples/person_k+hair_c \
--phase1_train_steps 200 \
--phase2_train_steps 300 \
--phase1_learning_rate 1e-4 \
--phase2_learning_rate 1e-5 \
--lora_rank 32 \
--alpha 0.5 \
--gamma 32 \
--lambda_preservation 0.2
You can refer to our paper or train.py
to understand the meaning of the arguments.
Adjusting these hyperparameters helps yield better results.
Moreover, we also provide a detailed training script in scripts/train.sh
for research or development purposes, supporting further modification.
After training, a model will be saved in outputs/magictailor
. Placeholder tokens <v0>
and <v1>
will be assigned to the concept and component respectively for text-to-image generation.
Then, you can generate images with the saved model, just like:
python inference.py \
--model_path "outputs/magictailor" \
--prompt "<v0> with <v1>" \
--output_path "outputs/inference/result.jpg"
- If you face the GPU memory limit, please consider reducing the number of reference images or moving the momentum denoising U-Net (self.unet_m) to another GPU if applicable (see the corresponding code in
train.py
). - While our default hyperparameters are suitable in most cases, further adjustment of hyperparameters for a training pair is still recommended, which helps to achieve a better trade-off between text alignment and identity fidelity.
If you find that our work is helpful in your research, please consider citing our paper:
@article{zhou2024magictailor,
title={MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models},
author={Zhou, Donghao and Huang, Jiancheng and Bai, Jinbin and Wang, Jiaze and Chen, Hao and Chen, Guangyong and Hu, Xiaowei and Heng, Pheng-Ann},
journal={arXiv preprint arXiv:2410.13370},
year={2024}
}
Our code is built upon the repositories of diffusers and Break-A-Scene. Thank their authors for their excellent work.