Image & Language (Retrieval & captioning & image generation )

No.	Model Name	Title	Links	Pub.	Organization	Release Time
1	ViusalGPT	VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning	paper code	arXiv 2021	KAUST	20 Feb 2021
2	Kaleido-BERT	Kaleido-BERT: Vision-Language Pre-training on Fashion Domain	paper code	CVPR 2021	Alibaba Group	15 April 2021
3	CLIPBERT	Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling	paper code	CVPR 2021	UNC	11 Feb 2021
4	-	Probabilistic Embeddings for Cross-Modal Retrieval	paper github	CVPR 2021	NAVER Lab	14 June 2021
5	-	Scaling Up Vision-Language Representation Learning With Noisy Text Supervision	paper	ICML 2021	Google	11 June 2021
6	-	Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training	paper	arXiv	MSRA	28 June 2021
7	CogView	CogView: Mastering Text-to-Image Generation via Transformers	paper code	arXiv	TsingHua University	28 May 2021
8	ViLT	ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	paper code	ICML 2021	NAVER AI lab	10 Jun 2021
9	-	Unifying Vision-and-Language Tasks via Text Generation	paper code	ICML 2021	UNC	23 May 2021
10	Pixel-BERT	Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers	paper	arXiv	Univesity of Science and Technology Beijing	22 Jun 2020
11	-	How Much Can CLIP Benefit Vision-and-Language Tasks?	paper	arXiv	UCB	13 Jul 2021
12	LXMERT	LXMERT: Learning Cross-Modality Encoder Representations from Transformers	paper code	EMNLP 2019	UNC Chapel Hill	3 Dec 2019
13	ViLBERT	VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks	paper code	NeurIPS 2019	Georgia Institute of Technology	6 Aug 2019
14	ImageBERT	ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data	paper	arXiv	Bing, Microsoft	23 Jan 2020
15	Unicoder-VL	Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training	paper	AAAI 2020	MSRA	2 Dec 2019
16	VLP	Unified Vision-Language Pre-Training for Image Captioning and VQA	paper code	AAAI 2020	University of Michigan	4 Dec 2019
17	XGPT	XGPT: Cross-modal Generative Pre-Training for Image Captioning	paper	arXiv	Peking University	4 Mar 2020
18	12-IN-1	12-in-1: Multi-Task Vision and Language Representation Learning	paper code	CVPR 2020	Facebook	5 Dec 2019
19	FashionBERT	FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval	paper	SIGIR	Alibaba	20 May 2020
20	UNITER	UNITER: UNiversal Image-TExt Representation Learning	paper code	ECCV 2020	Microsoft Dynamics 365 AI Research	25 Sep 2019
21	VisDial-BERT	Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline	paper code	ECCV 2020	1Georgia Institute of Technology	31 Mar 2020
22	OSCAR	Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks	paper code	ECCV 2020	Microsoft	13 Apr 2020
23	KD-VLP	KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation	paper	arXiv	ShanghaiTech	22 Sep 2021
24	Fast & Slow	Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers	paper	CVPR 2021	DeepMind	30 Mar 2021
25	-	Unifying Multimodal Transfomer for Bi-directional Image and Text Generation	paper	Arxiv	Sun Yat-sen University	19 Oct 2021
26	SOHO	Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning	paper	CVPR 2021	University of Science and Technology Beijing	8 Apr 2021
27	E2E-VLP	E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual	paper	ACL 2021	Alibaba Group	3 June 2021
28	KD-VLP	KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation	paper	EMNLP 2021	ShanghaiTech	22 Sep 2021
29	L-Verse	L-Verse: Bidirectional Generation Between Image and Text	paper	ArXiv	LG AI Research	22 Nov 2021
30	NUWA	NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion	paper	arXiv	MSRA	24 Nov 2021
31	Florence	Florence: A New Foundation Model for Computer Vision	paper	arXiv	Microsoft	22 Nov 2021
32	-	Distilled Dual-Encoder Model for Vision-Language Understanding	paper	arXiv	Microsoft	16 Dec 2021
33	FLAVA	FLAVA : A Foundational Language And Vision Alignment Model	paper	arXiv	FAIR	8 Dec 2021

Object Detection

No.	Model Name	Title	Links	Pub.	Organization	Release Time
1	MDTER	MDETR - Modulated Detection for End-to-End Multi-Modal Understanding	paper code	ICCV 2021	NYU	26 April 2021
2	pix2seq	pix2seq: A Language Modeling Framework for Object Detection	paper	arXiv	Google Research	22 Sep 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

image-language-transformer.md

image-language-transformer.md

Image & Language (Retrieval & captioning & image generation )

Object Detection

Files

image-language-transformer.md

Latest commit

History

image-language-transformer.md

File metadata and controls

Image & Language (Retrieval & captioning & image generation )

Object Detection