1 |
ViusalGPT |
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning |
paper code |
arXiv 2021 |
KAUST |
20 Feb 2021 |
2 |
Kaleido-BERT |
Kaleido-BERT: Vision-Language Pre-training on Fashion Domain |
paper code |
CVPR 2021 |
Alibaba Group |
15 April 2021 |
3 |
CLIPBERT |
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling |
paper code |
CVPR 2021 |
UNC |
11 Feb 2021 |
4 |
- |
Probabilistic Embeddings for Cross-Modal Retrieval |
paper github |
CVPR 2021 |
NAVER Lab |
14 June 2021 |
5 |
- |
Scaling Up Vision-Language Representation Learning With Noisy Text Supervision |
paper |
ICML 2021 |
Google |
11 June 2021 |
6 |
- |
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training |
paper |
arXiv |
MSRA |
28 June 2021 |
7 |
CogView |
CogView: Mastering Text-to-Image Generation via Transformers |
paper code |
arXiv |
TsingHua University |
28 May 2021 |
8 |
ViLT |
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision |
paper code |
ICML 2021 |
NAVER AI lab |
10 Jun 2021 |
9 |
- |
Unifying Vision-and-Language Tasks via Text Generation |
paper code |
ICML 2021 |
UNC |
23 May 2021 |
10 |
Pixel-BERT |
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers |
paper |
arXiv |
Univesity of Science and Technology Beijing |
22 Jun 2020 |
11 |
- |
How Much Can CLIP Benefit Vision-and-Language Tasks? |
paper |
arXiv |
UCB |
13 Jul 2021 |
12 |
LXMERT |
LXMERT: Learning Cross-Modality Encoder Representations from Transformers |
paper code |
EMNLP 2019 |
UNC Chapel Hill |
3 Dec 2019 |
13 |
ViLBERT |
VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks |
paper code |
NeurIPS 2019 |
Georgia Institute of Technology |
6 Aug 2019 |
14 |
ImageBERT |
ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data |
paper |
arXiv |
Bing, Microsoft |
23 Jan 2020 |
15 |
Unicoder-VL |
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training |
paper |
AAAI 2020 |
MSRA |
2 Dec 2019 |
16 |
VLP |
Unified Vision-Language Pre-Training for Image Captioning and VQA |
paper code |
AAAI 2020 |
University of Michigan |
4 Dec 2019 |
17 |
XGPT |
XGPT: Cross-modal Generative Pre-Training for Image Captioning |
paper |
arXiv |
Peking University |
4 Mar 2020 |
18 |
12-IN-1 |
12-in-1: Multi-Task Vision and Language Representation Learning |
paper code |
CVPR 2020 |
Facebook |
5 Dec 2019 |
19 |
FashionBERT |
FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval |
paper |
SIGIR |
Alibaba |
20 May 2020 |
20 |
UNITER |
UNITER: UNiversal Image-TExt Representation Learning |
paper code |
ECCV 2020 |
Microsoft Dynamics 365 AI Research |
25 Sep 2019 |
21 |
VisDial-BERT |
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline |
paper code |
ECCV 2020 |
1Georgia Institute of Technology |
31 Mar 2020 |
22 |
OSCAR |
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks |
paper code |
ECCV 2020 |
Microsoft |
13 Apr 2020 |
23 |
KD-VLP |
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation |
paper |
arXiv |
ShanghaiTech |
22 Sep 2021 |
24 |
Fast & Slow |
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers |
paper |
CVPR 2021 |
DeepMind |
30 Mar 2021 |
25 |
- |
Unifying Multimodal Transfomer for Bi-directional Image and Text Generation |
paper |
Arxiv |
Sun Yat-sen University |
19 Oct 2021 |
26 |
SOHO |
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning |
paper |
CVPR 2021 |
University of Science and Technology Beijing |
8 Apr 2021 |
27 |
E2E-VLP |
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual |
paper |
ACL 2021 |
Alibaba Group |
3 June 2021 |
28 |
KD-VLP |
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation |
paper |
EMNLP 2021 |
ShanghaiTech |
22 Sep 2021 |
29 |
L-Verse |
L-Verse: Bidirectional Generation Between Image and Text |
paper |
ArXiv |
LG AI Research |
22 Nov 2021 |
30 |
NUWA |
NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion |
paper |
arXiv |
MSRA |
24 Nov 2021 |
31 |
Florence |
Florence: A New Foundation Model for Computer Vision |
paper |
arXiv |
Microsoft |
22 Nov 2021 |
32 |
- |
Distilled Dual-Encoder Model for Vision-Language Understanding |
paper |
arXiv |
Microsoft |
16 Dec 2021 |
33 |
FLAVA |
FLAVA : A Foundational Language And Vision Alignment Model |
paper |
arXiv |
FAIR |
8 Dec 2021 |