Chalades |
paper |
10K |
16K |
10,000 |
82h |
daily household videos |
2016 |
action recoginition & captioning |
AMT |
MSRVTT |
paper |
10k |
200k |
7,180 |
40h |
web-crawed videos with 257 queries |
2016 |
retreival and captioning |
AMT |
Didemo |
paper |
27k |
41k |
10,464 |
87h |
randomly select over 14,000 videos from YFCC100M |
2017 |
Moment localization |
crowdsoucing |
M-VAD |
paper |
49k |
56k |
92 |
84h |
DVD movies |
2015 |
retreival |
crowdsourcing |
MPII-MD |
paper |
69k |
68k |
94 |
41h |
Web Movies |
2015 |
captioning |
crowdsourcing |
ActivityNet |
paper |
100k |
100k |
20,000 |
849h |
online human activities videos |
2017 |
captioning & retrieval |
AMT |
TGIF |
paper |
69k |
68k |
94 |
41h |
a year’s worth of GIF posts from Tumblr |
2015 |
captioning |
CrowdFlower |
YouCook2 |
paper |
14k |
14k |
2,000 |
176h |
online cooking videos |
2018 |
retreival & captioning |
well-trained native English speakers |
LSMDC |
paper |
128k |
128k |
200 |
150h |
comination of M-VAD and MPII-MD datasets |
2017 |
captioning |
/ |
HowTo100M |
paper |
136M |
136M |
1.221M |
134,472h |
large-scaled online videos |
2019 |
action step localization & retreival |
ASR |
Kinetics-700 |
paper |
650K |
/ |
650K |
/ |
an extension of kinetics-700 dataset |
2019 |
action recoginition |
/ |
AVA-Kinetics |
paper |
230K |
/ |
230K |
/ |
combines the annotation style of AVA and kinetics dataset |
2020 |
action recoginition |
/ |
HACS |
paper |
1.5M |
/ |
504K |
/ |
large scale human action localization dataset |
2019 |
action recoginition&captioning |
crowdsourcing |
Tiny-Virat |
paper |
13K |
/ |
13K |
/ |
low-resolution action recognition dataset (surveillance videos) |
2020 |
action recognition |
/ |
Action Genome |
paper |
234K |
/ |
234K |
/ |
video scene graph |
2020 |
action recoginition& representations encoding eventpartonomies |
crowdsourcing |
SoccerNet |
paper |
650K |
764h |
650K |
/ |
European Football League video |
2018 |
event classification in football game video |
transformed from the data from league websites |
ActivityNet Entities |
paper |
650K |
/ |
650K |
/ |
ground the visual entity with the activitynet video objects |
2018 |
video understanding & action recognition |
crowdsourcing |
VidSitu |
paper |
136K |
/ |
29K |
/ |
the events and related roles in the movies |
2021 |
semantic role and co-referencing prediction |
AMT |
VATEX |
paper |
41.3k |
826k |
41.3k |
114h38m |
human behavior video from YouTube |
2019 |
action recoginition&captioning |
/ |
MSVD |
paper |
2k |
70k |
2k |
4h55m |
web videos |
2011 |
video captioning |
AMT |
MovieNet |
paper |
420k |
25k |
420k |
/ |
Web Movies |
2020 |
Genre classification & cinematic style analysis & character recognition & scene analysis & story understanding |
crowdsourcing |
MovieGraphs |
paper |
7.6k |
70k |
51 |
150h |
scene graph representation of movie |
2018 |
description retreival & dialog retrieval & Movie Clip Retrieval |
crowdsourcing |
QVHIGHLIGHTS |
paper |
10.3k |
10.2k |
10.3k |
/ |
daily or travel vlog and news |
2021 |
moment retreival & highlight detection |
AMT |
UCF101 |
paper |
13.3k |
1600m |
13.3k |
/ |
user-uploaded videos |
2012 |
action recoginition |
crowdsourcing |
HMDB51 |
paper |
7K |
/ |
7K |
/ |
action videos from Youtube/Google |
2011 |
action recoginition&captioning |
crowdsourcing |
Moments-in-Time |
paper |
1M |
/ |
1M |
/ |
edited videos from YouTube, Flickr, Vine, Metacafe and other sources |
2017 |
action&event recognition |
AMT |
AVA |
paper |
57.6K |
300k |
57.6K |
/ |
Web Movies with human bounding boxes |
2017 |
atomic visual actions recogintion |
crowdsourcing |
HVU |
paper |
57.2K |
9M |
57.2K |
/ |
Youtube |
2020 |
multi-label and multi-task video understanding |
semi-automatic crowdsourcing strategy |
Oops! |
paper |
20K |
/ |
20K |
/ |
in-the-wild videos of unintentional action |
2019 |
unintentional action recoginition |
AMT |
CrossTask |
paper |
4.7K |
/ |
4.7K |
/ |
weakly supervising learning from instructional videos |
2019 |
video classification |
crowdsourcing |
COIN |
paper |
11.8K |
/ |
11.8K |
/ |
Comprehensive instructional video analysis |
2019 |
step localization & action recoginition |
crowdsourcing |
Sports-1M |
paper |
1.1M |
/ |
1.1M |
/ |
sports video from Youtube |
2014 |
video classification |
crowdsourcing labed with taxonomy |
20BN-SOMETHING-SOMETHING |
paper |
220K |
318K |
220K |
/ |
show humans performing pre-defined basic actions with everyday objects |
2017 |
action recoginition |
AMT |
DALY |
paper |
8.1K |
/ |
8.1K |
/ |
Daily Action Localization in YouTube |
2016 |
video classification |
crowdsourcing |
FineGym |
paper |
8.1K |
/ |
8.1K |
/ |
gymnastic videos with temporal actions and sub-actions |
2020 |
video action recognition&detection&generation |
crowdsourcing |
MultiSports |
paper |
3.2K |
/ |
3.2K |
/ |
competition videos with high resolution held in recent years |
2021 |
spatio-temporal action detection |
/ |
“Wildlife Action” |
paper |
10.6K |
/ |
10.6K |
/ |
downloaded from YouTube |
2020 |
animal action recognition |
YouTube’s Data API |
“Action Recogniation of Large Animals” |
paper |
/ |
/ |
/ |
/ |
downloaded from YouTube |
2018 |
animal action recognition |
YouTube’s Data API |
“First-Person Animal Action” |
paper |
/ |
/ |
/ |
/ |
collected by a dog wearing a GoPro size camera |
2014 |
first-person animal activity recogniation |
/ |
AnimalWeb |
paper |
/ |
/ |
/ |
/ |
collected by a dog wearing a GoPro size camera |
2014 |
first-person animal activity recogniation |
/ |