git clone https://github.com/showlab/UniVTG
cd UniVTG
conda create --name univtg python=3.8
pip install -r requirements.txt
An engineering contribution is that we unify most video temporal tasks by the same features, which makes pre-training or cross-training flexible.
- Download the features and metadata for pertaining and downstream datasets. (skip pretraining if not needed)
Dataset | Task | Metadata | Video (Slowfast R50) | Video (CLIP B/32) | Text (CLIP B/32) |
---|---|---|---|---|---|
Point (Ego4D) | PT | 548 MB | 27.1 GB | 5.7 GB | 30.7 GB |
Interval (VideoCC) | PT | 155 MB | 300 GB | 62.5 GB | 12.6 GB |
Curve (VideoCC) | PT | 3.8GB | 👆 | 👆 | 132 MB |
QVHighlights | MR + HL | 5 MB | 4.0 GB | 940 MB | 172 MB |
Charades-STA | MR | 4 MB | 1.3 GB | 305 MB | 178 MB |
NLQ | MR | 3 MB | 1.8 GB | 404 MB | 184 MB |
TACoS | MR | 2 MB | 81 MB | 18 MB | 244 MB |
YoutubeHL | HL | 1 MB | 427 MB | 95 MB | 2 MB |
TVSum | HL | 1 MB | 28 MB | 6 MB | 1 MB |
QFVS | VS | 1MB | 455 MB | 👈 | 1MB |
ActivityNet (optional) | MR | 10 MB | 4.5 GB | 1.0 GB | 958 MB |
DiDeMo (optional) | MR | 6 MB | 1.1 GB | 269 MB | 443 MB |
HACS (optional) | MR | 15 MB | 13.1 GB | 3.0 GB | 177 MB |
COIN (optional) | MR | 8 MB | 2.3 GB | 556 MB | 30 MB |
- Unzip the downloaded tar by
tar -xvf {tar_name}.tar
mv data/home/qinghonglin/univtg/data/{dset_name}/* . # Replace dset_name accordingly
For VideoCC Slowfast features, first group multiple sub-zips into the same one, then unzip it.
gunzip vid_slowfast_*.gz
cat vid_slowfast_* > vid_slowfast.tar
-
Organize the data / features in the following structure
univtg ├── eval ├── data │ ├── qfvs │ ├── tvsum │ ├── youtube │ ├── tacos │ ├── ego4d │ ├── charades │ │ ├── metadata │ │ │ ├──charades_test.jsonl │ │ │ └──charades_train.jsonl │ │ ├── txt_clip │ │ ├── vid_clip │ │ └── vid_slowfast │ └── qvhighlights │ ├── metadata │ │ ├──qvhighlights_test.jsonl │ │ ├──qvhighlights_train.jsonl │ │ └──qvhighlights_val.jsonl │ ├── txt_clip │ ├── vid_clip │ └── vid_slowfast ├── main ├── model ├── utils ├── README.md └── ···
-
(Optional) We extract video features (Slowfast R/50 and CLIP B/32) based on this repo: HERO_Video_Feature_Extractor, you can use it extract other benchmarks or videos; We extract text features (CLIP B/32) by
run_on_video/text_extractor.py