Skip to content

Latest commit

 

History

History
50 lines (41 loc) · 7.07 KB

image-language-transformer.md

File metadata and controls

50 lines (41 loc) · 7.07 KB

Image & Language (Retrieval & captioning & image generation )

No. Model Name Title Links Pub. Organization Release Time
1 ViusalGPT VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning paper code arXiv 2021 KAUST 20 Feb 2021
2 Kaleido-BERT Kaleido-BERT: Vision-Language Pre-training on Fashion Domain paper code CVPR 2021 Alibaba Group 15 April 2021
3 CLIPBERT Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling paper code CVPR 2021 UNC 11 Feb 2021
4 - Probabilistic Embeddings for Cross-Modal Retrieval paper github CVPR 2021 NAVER Lab 14 June 2021
5 - Scaling Up Vision-Language Representation Learning With Noisy Text Supervision paper ICML 2021 Google 11 June 2021
6 - Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training paper arXiv MSRA 28 June 2021
7 CogView CogView: Mastering Text-to-Image Generation via Transformers paper code arXiv TsingHua University 28 May 2021
8 ViLT ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision paper code ICML 2021 NAVER AI lab 10 Jun 2021
9 - Unifying Vision-and-Language Tasks via Text Generation paper code ICML 2021 UNC 23 May 2021
10 Pixel-BERT Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers paper arXiv Univesity of Science and Technology Beijing 22 Jun 2020
11 - How Much Can CLIP Benefit Vision-and-Language Tasks? paper arXiv UCB 13 Jul 2021
12 LXMERT LXMERT: Learning Cross-Modality Encoder Representations from Transformers paper code EMNLP 2019 UNC Chapel Hill 3 Dec 2019
13 ViLBERT VilBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks paper code NeurIPS 2019 Georgia Institute of Technology 6 Aug 2019
14 ImageBERT ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data paper arXiv Bing, Microsoft 23 Jan 2020
15 Unicoder-VL Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training paper AAAI 2020 MSRA 2 Dec 2019
16 VLP Unified Vision-Language Pre-Training for Image Captioning and VQA paper code AAAI 2020 University of Michigan 4 Dec 2019
17 XGPT XGPT: Cross-modal Generative Pre-Training for Image Captioning paper arXiv Peking University 4 Mar 2020
18 12-IN-1 12-in-1: Multi-Task Vision and Language Representation Learning paper code CVPR 2020 Facebook 5 Dec 2019
19 FashionBERT FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal Retrieval paper SIGIR Alibaba 20 May 2020
20 UNITER UNITER: UNiversal Image-TExt Representation Learning paper code ECCV 2020 Microsoft Dynamics 365 AI Research 25 Sep 2019
21 VisDial-BERT Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline paper code ECCV 2020 1Georgia Institute of Technology 31 Mar 2020
22 OSCAR Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks paper code ECCV 2020 Microsoft 13 Apr 2020
23 KD-VLP KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation paper arXiv ShanghaiTech 22 Sep 2021
24 Fast & Slow Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers paper CVPR 2021 DeepMind 30 Mar 2021
25 - Unifying Multimodal Transfomer for Bi-directional Image and Text Generation paper Arxiv Sun Yat-sen University 19 Oct 2021
26 SOHO Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning paper CVPR 2021 University of Science and Technology Beijing 8 Apr 2021
27 E2E-VLP E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual paper ACL 2021 Alibaba Group 3 June 2021
28 KD-VLP KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation paper EMNLP 2021 ShanghaiTech 22 Sep 2021
29 L-Verse L-Verse: Bidirectional Generation Between Image and Text paper ArXiv LG AI Research 22 Nov 2021
30 NUWA NUWA: Visual Synthesis Pre-training for Neural visUal World creAtion paper arXiv MSRA 24 Nov 2021
31 Florence Florence: A New Foundation Model for Computer Vision paper arXiv Microsoft 22 Nov 2021
32 - Distilled Dual-Encoder Model for Vision-Language Understanding paper arXiv Microsoft 16 Dec 2021
33 FLAVA FLAVA : A Foundational Language And Vision Alignment Model paper arXiv FAIR 8 Dec 2021

Object Detection

No. Model Name Title Links Pub. Organization Release Time
1 MDTER MDETR - Modulated Detection for End-to-End Multi-Modal Understanding paper code ICCV 2021 NYU 26 April 2021
2 pix2seq pix2seq: A Language Modeling Framework for Object Detection paper arXiv Google Research 22 Sep 2021