Common multimodal datasets

Image Datasets

Video&language Dataset

Dataset	paper	Clips	Captions	Videos	Duration	Source	Year	Tasks	collection method
Chalades	paper	10K	16K	10,000	82h	daily household videos	2016	action recoginition & captioning	AMT
MSRVTT	paper	10k	200k	7,180	40h	web-crawed videos with 257 queries	2016	retreival and captioning	AMT
Didemo	paper	27k	41k	10,464	87h	randomly select over 14,000 videos from YFCC100M	2017	Moment localization	crowdsoucing
M-VAD	paper	49k	56k	92	84h	DVD movies	2015	retreival	crowdsourcing
MPII-MD	paper	69k	68k	94	41h	Web Movies	2015	captioning	crowdsourcing
ActivityNet	paper	100k	100k	20,000	849h	online human activities videos	2017	captioning & retrieval	AMT
TGIF	paper	69k	68k	94	41h	a year’s worth of GIF posts from Tumblr	2015	captioning	CrowdFlower
YouCook2	paper	14k	14k	2,000	176h	online cooking videos	2018	retreival & captioning	well-trained native English speakers
LSMDC	paper	128k	128k	200	150h	comination of M-VAD and MPII-MD datasets	2017	captioning	/
HowTo100M	paper	136M	136M	1.221M	134,472h	large-scaled online videos	2019	action step localization & retreival	ASR
Kinetics-700	paper	650K	/	650K	/	an extension of kinetics-700 dataset	2019	action recoginition	/
AVA-Kinetics	paper	230K	/	230K	/	combines the annotation style of AVA and kinetics dataset	2020	action recoginition	/
HACS	paper	1.5M	/	504K	/	large scale human action localization dataset	2019	action recoginition&captioning	crowdsourcing
Tiny-Virat	paper	13K	/	13K	/	low-resolution action recognition dataset (surveillance videos)	2020	action recognition	/
Action Genome	paper	234K	/	234K	/	video scene graph	2020	action recoginition& representations encoding eventpartonomies	crowdsourcing
SoccerNet	paper	650K	764h	650K	/	European Football League video	2018	event classification in football game video	transformed from the data from league websites
ActivityNet Entities	paper	650K	/	650K	/	ground the visual entity with the activitynet video objects	2018	video understanding & action recognition	crowdsourcing
VidSitu	paper	136K	/	29K	/	the events and related roles in the movies	2021	semantic role and co-referencing prediction	AMT
VATEX	paper	41.3k	826k	41.3k	114h38m	human behavior video from YouTube	2019	action recoginition&captioning	/
MSVD	paper	2k	70k	2k	4h55m	web videos	2011	video captioning	AMT
MovieNet	paper	420k	25k	420k	/	Web Movies	2020	Genre classification & cinematic style analysis & character recognition & scene analysis & story understanding	crowdsourcing
MovieGraphs	paper	7.6k	70k	51	150h	scene graph representation of movie	2018	description retreival & dialog retrieval & Movie Clip Retrieval	crowdsourcing
QVHIGHLIGHTS	paper	10.3k	10.2k	10.3k	/	daily or travel vlog and news	2021	moment retreival & highlight detection	AMT
UCF101	paper	13.3k	1600m	13.3k	/	user-uploaded videos	2012	action recoginition	crowdsourcing
HMDB51	paper	7K	/	7K	/	action videos from Youtube/Google	2011	action recoginition&captioning	crowdsourcing
Moments-in-Time	paper	1M	/	1M	/	edited videos from YouTube, Flickr, Vine, Metacafe and other sources	2017	action&event recognition	AMT
AVA	paper	57.6K	300k	57.6K	/	Web Movies with human bounding boxes	2017	atomic visual actions recogintion	crowdsourcing
HVU	paper	57.2K	9M	57.2K	/	Youtube	2020	multi-label and multi-task video understanding	semi-automatic crowdsourcing strategy
Oops!	paper	20K	/	20K	/	in-the-wild videos of unintentional action	2019	unintentional action recoginition	AMT
CrossTask	paper	4.7K	/	4.7K	/	weakly supervising learning from instructional videos	2019	video classification	crowdsourcing
COIN	paper	11.8K	/	11.8K	/	Comprehensive instructional video analysis	2019	step localization & action recoginition	crowdsourcing
Sports-1M	paper	1.1M	/	1.1M	/	sports video from Youtube	2014	video classification	crowdsourcing labed with taxonomy
20BN-SOMETHING-SOMETHING	paper	220K	318K	220K	/	show humans performing pre-defined basic actions with everyday objects	2017	action recoginition	AMT
DALY	paper	8.1K	/	8.1K	/	Daily Action Localization in YouTube	2016	video classification	crowdsourcing
FineGym	paper	8.1K	/	8.1K	/	gymnastic videos with temporal actions and sub-actions	2020	video action recognition&detection&generation	crowdsourcing
MultiSports	paper	3.2K	/	3.2K	/	competition videos with high resolution held in recent years	2021	spatio-temporal action detection	/
“Wildlife Action”	paper	10.6K	/	10.6K	/	downloaded from YouTube	2020	animal action recognition	YouTube’s Data API
“Action Recogniation of Large Animals”	paper	/	/	/	/	downloaded from YouTube	2018	animal action recognition	YouTube’s Data API
“First-Person Animal Action”	paper	/	/	/	/	collected by a dog wearing a GoPro size camera	2014	first-person animal activity recogniation	/
AnimalWeb	paper	/	/	/	/	collected by a dog wearing a GoPro size camera	2014	first-person animal activity recogniation	/

Video Dataset

Dataset	Videos	Duration	Source	Year
Youtube8M	6M	350,000	YouTube	2018
FineAction	16,732	-	YouTube	24 May 2021
VideoLT	256,218	819,898	YouTube	6 May 2021

dataset collection tools

voxel
amazon turkers
shaip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.md

datasets.md

Common multimodal datasets

Image Datasets

Video&language Dataset

Video Dataset

dataset collection tools

Files

datasets.md

Latest commit

History

datasets.md

File metadata and controls

Common multimodal datasets

Image Datasets

Video&language Dataset

Video Dataset

dataset collection tools