Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. This repo aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models (VLMs): multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion) (Fig. 1).
Fig. 1: This work focuses on three main types of vision-language models.
This repo lists relevant papers summarized in our survey:
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models. Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr. Preprint 2023. [pdf]
If you find our paper and repo helpful to your research, please cite the following paper:
@article{gu2023survey,
title={A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models},
author={Gu, Jindong and Han, Zhen and Chen, Shuo, and Beirami, Ahmad and He, Bailan and Zhang, Gengyuan and Liao, Ruotong and Qin, Yao and Tresp, Volker and Torr, Philip}
journal={arXiv preprint arXiv:2307.12980},
year={2023}
}
There are two main types of fusion module approaches based on the integration of visual and textual modalities: encoder-decoder as a multi-modal fusion module and decoder-only as a multi-modal fusion module. Prompting methods can be divided into two main categories (Fig. 2) based on the readability of the templates: hard prompt and soft prompt. Hard prompt encompasses four subcategories: task instruction, in-context learning, retrieval-based prompting, and chain-of-thought prompting. Soft prompts are classified into two strategies: prompt tuning and prefix token tuning, based on whether they internally add new tokens to the model's architecture or simply append them to the input. this study primarily concentrates on prompt methods that avoid altering the base model.
Fig. 2 : Classification of prompting methods.
Depending on the target of prompting, existing methods can be classified into three categories: prompting the text encoder, prompting the visual encoder, or jointly prompting both branches as shown in Fig. 2 . These approaches aim to enhance the flexibility and task-specific performance of VLMs.
Fig. 2: Classification of prompting methods on Image-Text Matching VLMs.
Please contact us (jindong.gu@outlook.com, chenshuo.cs@outlook.com) if
- you would like to add your papers in this repo,
- you find any mistakes in this repo,
- you have any suggestions for this repo.