PyTorch code of the ACM MM'23 paper: LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation.
In the text-to-image generation field, recent remarkable progress in Stable Diffusion makes it possible to generate rich kinds of novel photorealistic images. However, current models still face misalignment issues (e.g., problematic spatial relation understanding and numeration failure) in complex natural scenes, which impedes the high-faithfulness text-to-image generation. Although recent efforts have been made to improve controllability by giving fine-grained guidance (e.g., sketch and scribbles), this issue has not been fundamentally tackled since users have to provide such guidance information manually. In this work, we strive to synthesize high-fidelity images that are semantically aligned with a given textual prompt without any guidance. Toward this end, we propose a coarse-to-fine paradigm to achieve layout planning and image generation. Concretely, we first generate the coarse-grained layout conditioned on a given textual prompt via in-context learning based on Large Language Models. Afterward, we propose a fine-grained object-interaction diffusion method to synthesize high-faithfulness images conditioned on the prompt and the automatically generated layout. Extensive experiments demonstrate that our proposed method outperforms the state-of-the-art models in terms of layout and image generation.
By filtering, scrutinizing, and sampling from captions of COCO 2014, we built a new benchmark called COCO-NSS1K to evaluate the Numerical reasoning, Spatial and Semantic Relation understanding of text-to-image generative models.
#Num | #Avg.bbox | #Avg.Cap.Len | Caption Examples | |
---|---|---|---|---|
Numerical | 155 | 6.23 | 9.55 | • two old cell phones and a wooden table. • two plates some food and a fork knife and spoon. |
Spatial | 200 | 5.35 | 10.25 | • a large clock tower next to a small white church. • a bowl with some noodles inside of it. |
Semantic | 200 | 7.10 | 10.62 | • a train on a track traveling through a countryside. • a living room filled with couches, chairs, TV, and windows. |
Mixed | 188 | 6.94 | 10.76 | • one motorcycle rider riding going up the mountain, two going down. • a group of three bathtubs sitting next to each other. |
Null | 200 | 6.17 | 9.62 | • a kitchen scene complete with a dishwasher, sink, and an oven. • a person with a hat and some ski poles. |
Total | 943 | 6.35 | 10.18 | - |
1. Download repo and create environment
https://github.com/LayoutLLM-T2I/LayoutLLM-T2I.git
conda create -n layoutllm_t2i python=3.8
conda activate layoutllm_t2i
pip install -r requirements.txt
2. Download and prepare the pretrained weights
This model includes a policy model and a GLIGEN-based relation-aware diffusion model. The policy weights can be downloaded hereand saved in POLICY_CKPT
. The diffusion model weights are downloaded here (Baidu) or here (Huggingface) and saved in DIFFUSION_CKPT
.
Download the candidate example file in which each instance was randomly sampled from COCO2014, and save it in CANDIDATE_PATH
. Obtain OPENAI_API_KEY
from the openai platform.
Run the generation code:
export OPENAI_API_KEY=OPENAI_API_KEY
python txt2img.py --folder generation_samples
--prompt PROMPT
--policy_ckpt_path POLICY_CKPT
--diff_ckpt_path DIFFUSION_CKPT
--cand_path CANDIDATE_PATH
--num_per_prompt 1
PROMPT
denotes your prompt.
To train the the policy network, we first download images from COCO2014 and the sampled training examples. Alternatively, one may sample some examples by himself. Besides, download the weights of the aesthetic predictor pre-trained on the LAION dataset.
Run the training code:
export OPENAI_API_KEY=OPENAI_API_KEY
python -u train_rl.py
--gpu GPU_ID
--exp EXPERIMENT_NAME
--img_dir IMAGE_DIR
--sampled_data_dir DATA_PATH
--diff_ckpt_path DIFFUSION_CKPT
--aesthetic_ckpt AESTHETIC_CKPT
@inproceedings{qu2023layoutllm,
title={LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation},
author={Qu, Leigang and Wu, Shengqiong and Fei, Hao and Nie, Liqiang and Chua, Tat-Seng},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={643--654},
year={2023}
}