GitHub - Vchitect/VBench: [CVPR2024 Highlight] VBench

This repository contains the implementation of the following paper and its related serial works in progress. We evaluate video generative models!

VBench: Comprehensive Benchmark Suite for Video Generative Models
Ziqi Huang^∗, Yinan He^∗, Jiashuo Yu^∗, Fan Zhang^∗, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin⁺, Yu Qiao⁺, Ziwei Liu⁺
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
Ziqi Huang^∗, Fan Zhang^∗, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin⁺, Yu Qiao⁺, Ziwei Liu⁺

🔥 Updates

[11/2024] VBench++ released:
[09/2024] VBench-Long Leaderboard available: Our VBench-Long leaderboard now has 10 long video generation models. VBench leaderboard now has 40 text-to-video (both long and short) models. All video generative models are encouraged to participate!
[09/2024] PyPI Updates: PyPI package is updated to version 0.1.4: bug fixes and multi-gpu inference.
[08/2024] Longer and More Descriptive Prompts: Available Here! We follow CogVideoX's prompt optimization technique to enhance VBench prompts using GPT-4o, making them longer and more descriptive without altering their original meaning.
[08/2024] VBench Leaderboard update: Our leaderboard has 28 T2V models, 12 I2V models so far. All video generative models are encouraged to participate!
[06/2024] 🔥 VBench-Long 🔥 is ready to use for evaluating longer Sora-like videos!
[06/2024] Model Info Documentation: Information on video generative models in our VBench Leaderboard is documented HERE.
[05/2024] PyPI Update: PyPI package vbench is updated to version 0.1.2. This includes changes in the preprocessing for high-resolution images/videos for imaging_quality, support for evaluating customized videos, and minor bug fixes.
[04/2024] We release all the videos we sampled and used for VBench evaluation. See details here.
[03/2024] 🔥 VBench-Trustworthiness 🔥 We now support evaluating the trustworthiness (e.g., culture, fairness, bias, safety) of video generative models.
[03/2024] 🔥 VBench-I2V 🔥 We now support evaluating Image-to-Video (I2V) models. We also provide Image Suite.
[03/2024] We support evaluating customized videos! See here for instructions.
[01/2024] PyPI package is released! . Simply pip install vbench.
[12/2023] 🔥 VBench 🔥 Evaluation code released for 16 Text-to-Video (T2V) evaluation dimensions.
- ['subject_consistency', 'background_consistency', 'temporal_flickering', 'motion_smoothness', 'dynamic_degree', 'aesthetic_quality', 'imaging_quality', 'object_class', 'multiple_objects', 'human_action', 'color', 'spatial_relationship', 'scene', 'temporal_style', 'appearance_style', 'overall_consistency']
[11/2023] Prompt Suites released. (See prompt lists here)

📣 Overview

We propose VBench, a comprehensive benchmark suite for video generative models. We design a comprehensive and hierarchical Evaluation Dimension Suite to decompose "video generation quality" into multiple well-defined dimensions to facilitate fine-grained and objective evaluation. For each dimension and each content category, we carefully design a Prompt Suite as test cases, and sample Generated Videos from a set of video generation models. For each evaluation dimension, we specifically design an Evaluation Method Suite, which uses carefully crafted method or designated pipeline for automatic objective evaluation. We also conduct Human Preference Annotation for the generated videos for each dimension, and show that VBench evaluation results are well aligned with human perceptions. VBench can provide valuable insights from multiple perspectives. VBench++ supports a wide range of video generation tasks, including text-to-video and image-to-video, with an adaptive Image Suite for fair evaluation across different settings. It evaluates not only technical quality but also the trustworthiness of generative models, offering a comprehensive view of model performance. We continually incorporate more video generative models into VBench to inform the community about the evolving landscape of video generation.

🎓 Evaluation Results

See our leaderboard for the most updated ranking and numerical results (with models like Gen-3, Kling, Pika).

We visualize VBench evaluation results of various publicly available video generation models, as well as Gen-2 and Pika, across 16 VBench dimensions. We normalize the results per dimension for clearer comparisons.

🏆 Leaderboard

See numeric values at our Leaderboard 🥇🥈🥉

How to join VBench Leaderboard? See the 3 options below:

Sampling Party	Evaluation Party	Comments
VBench Team	VBench Team	We periodically allocate resources to sample newly released models and perform evaluations. You can request us to perform sampling and evaluation, but the progress depends on our available resources.
Your Team	VBench Team	For non-open-source models interested in joining our leaderboard, submit your video samples to us for evaluation. If you prefer to provide the evaluation results directly, see the row below.
Your Team	Your Team	If you have already used VBench for full evaluation in your report/paper, submit your `eval_results.zip` files to the VBench Leaderboard using the `Submit here!` form. The evaluation results will be automatically updated to the leaderboard. Also, share your model information for our records for any columns here.

📽️ Model Info

See model info for video generation models we used for evaluation.

Evaluation Criterion

For videos with a duration >= 5.0s, we use VBench-Long for evaluation.
For videos with a duration < 5.0s, we use VBench for evaluation.

🔨 Installation

Install with pip

pip install vbench

To evaluate some video generation ability aspects, you need to install detectron2 via:

pip install detectron2@git+https://github.com/facebookresearch/detectron2.git

If there is an error during detectron2 installation, see here.

Download VBench_full_info.json to your running directory to read the benchmark prompt suites.

Install with git clone

git clone https://github.com/Vchitect/VBench.git
pip install -r VBench/requirements.txt
pip install VBench

If there is an error during detectron2 installation, see here.

Usage

Use VBench to evaluate videos, and video generative models.

A Side Note: VBench is designed for evaluating different models on a standard benchmark. Therefore, by default, we enforce evaluation on the standard VBench prompt lists to ensure fair comparisons among different video generation models. That's also why we give warnings when a required video is not found. This is done via defining the set of prompts in VBench_full_info.json. However, we understand that many users would like to use VBench to evaluate their own videos, or videos generated from prompts that does not belong to the VBench Prompt Suite, so we also added the function of Evaluating Your Own Videos. Simply set mode=custom_input, and you can evaluate your own videos.

[New] Evaluate Your Own Videos

We support evaluating any video. Simply provide the path to the video file, or the path to the folder that contains your videos. There is no requirement on the videos' names.

Note: We support customized videos / prompts for the following dimensions: 'subject_consistency', 'background_consistency', 'motion_smoothness', 'dynamic_degree', 'aesthetic_quality', 'imaging_quality'

To evaluate videos with customized input prompt, run our script with --mode=custom_input:

python evaluate.py \
    --dimension $DIMENSION \
    --videos_path /path/to/folder_or_video/ \
    --mode=custom_input

alternatively you can use our command:

vbench evaluate \
    --dimension $DIMENSION \
    --videos_path /path/to/folder_or_video/ \
    --mode=custom_input

To evaluate using multiple gpus, we can use the following commands:

torchrun --nproc_per_node=${GPUS} --standalone evaluate.py ...args...

or

vbench evaluate --ngpus=${GPUS} ...args...

Evaluation on the Standard Prompt Suite of VBench

Command Line

vbench evaluate --videos_path $VIDEO_PATH --dimension $DIMENSION

For example:

vbench evaluate --videos_path "sampled_videos/lavie/human_action" --dimension "human_action"

Python

from vbench import VBench
my_VBench = VBench(device, <path/to/VBench_full_info.json>, <path/to/save/dir>)
my_VBench.evaluate(
    videos_path = <video_path>,
    name = <name>,
    dimension_list = [<dimension>, <dimension>, ...],
)

For example:

from vbench import VBench
my_VBench = VBench(device, "vbench/VBench_full_info.json", "evaluation_results")
my_VBench.evaluate(
    videos_path = "sampled_videos/lavie/human_action",
    name = "lavie_human_action",
    dimension_list = ["human_action"],
)

Evaluation of Different Content Categories

command line

vbench evaluate \
    --videos_path $VIDEO_PATH \
    --dimension $DIMENSION \
    --mode=vbench_category \
    --category=$CATEGORY

or

python evaluate.py \
    --dimension $DIMENSION \
    --videos_path /path/to/folder_or_video/ \
    --mode=vbench_category

Example of Evaluating VideoCrafter-1.0

We have provided scripts to download VideoCrafter-1.0 samples, and the corresponding evaluation scripts.

# download sampled videos
sh scripts/download_videocrafter1.sh

# evaluate VideoCrafter-1.0
sh scripts/evaluate_videocrafter1.sh

Submit to Leaderboard

We have provided scripts for calculating the Total Score, Quality Score, and Semantic Score in the Leaderboard. You can run them locally to obtain the aggregate scores or as a final check before submitting to the Leaderboard.

# Pack the evaluation results into a zip file.
cd evaluation_results
zip -r ../evaluation_results.zip .

# [Optional] get the total score of your submission file.
python scripts/cal_final_score.py --zip_file {path_to_evaluation_results.zip} --model_name {your_model_name}

You can submit the json file to HuggingFace

How to Calculate Total Score

To calculate the Total Score, we follow these steps:

Normalization:
Each dimension's results are normalized using the following formula:
```
Normalized Score = (dim_score - min_val) / (max_val - min_val)
```
Quality Score:
The Quality Score is a weighted average of the following dimensions:
subject consistency, background consistency, temporal flickering, motion smoothness, aesthetic quality, imaging quality, and dynamic degree.
Semantic Score:
The Semantic Score is a weighted average of the following dimensions:
object class, multiple objects, human action, color, spatial relationship, scene, appearance style, temporal style, and overall consistency.
Weighted Average Calculation:
The Total Score is a weighted average of the Quality Score and Semantic Score:
```
Total Score = w1 * Quality Score + w2 * Semantic Score
```

The minimum and maximum values used for normalization in each dimension, as well as the weighting coefficients for the average calculation, can be found in the scripts/constant.py file.

TODO

Total Score Calculation for VBench-I2V

💎 Pre-Trained Models

[Optional] Please download the pre-trained weights according to the guidance in the model_path.txt file for each model in the pretrained folder to ~/.cache/vbench.

📑 Prompt Suite

We provide prompt lists are at prompts/.

Check out details of prompt suites, and instructions for how to sample videos for evaluation.

📑 Sampled Videos

To facilitate future research and to ensure full transparency, we release all the videos we sampled and used for VBench evaluation. You can download them on Google Drive.

See detailed explanations of the sampled videos here.

We also provide detailed setting for the models under evaluation here.

🏄 Evaluation Method Suite

To perform evaluation on one dimension, run this:

python evaluate.py --videos_path $VIDEOS_PATH --dimension $DIMENSION

The complete list of dimensions:

['subject_consistency', 'background_consistency', 'temporal_flickering', 'motion_smoothness', 'dynamic_degree', 'aesthetic_quality', 'imaging_quality', 'object_class', 'multiple_objects', 'human_action', 'color', 'spatial_relationship', 'scene', 'temporal_style', 'appearance_style', 'overall_consistency']

Alternatively, you can evaluate multiple models and multiple dimensions using this script:

bash evaluate.sh

The default sampled video paths:

vbench_videos/{model}/{dimension}/{prompt}-{index}.mp4/gif

Before evaluating the temporal flickering dimension, it is necessary to filter out the static videos first.

To filter static videos in the temporal flickering dimension, run this:

# This only filter out static videos whose prompt matches the prompt in the temporal_flickering.
python static_filter.py --videos_path $VIDEOS_PATH

You can adjust the filtering scope by:

# 1. Change the filtering scope to consider all files inside videos_path for filtering.
python static_filter.py --videos_path $VIDEOS_PATH --filter_scope all

# 2. Specify the path to a JSON file ($filename) to consider only videos whose prompts match those listed in $filename.
python static_filter.py --videos_path $VIDEOS_PATH --filter_scope $filename

✒️ Citation

If you find our repo useful for your research, please consider citing our paper:

 @InProceedings{huang2023vbench,
     title={{VBench}: Comprehensive Benchmark Suite for Video Generative Models},
     author={Huang, Ziqi and He, Yinan and Yu, Jiashuo and Zhang, Fan and Si, Chenyang and Jiang, Yuming and Zhang, Yuanhan and Wu, Tianxing and Jin, Qingyang and Chanpaisit, Nattapol and Wang, Yaohui and Chen, Xinyuan and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
     booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
     year={2024}
 }

 @article{huang2024vbench++,
     title={VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models},
     author={Huang, Ziqi and Zhang, Fan and Xu, Xiaojie and He, Yinan and Yu, Jiashuo and Dong, Ziyue and Ma, Qianli and Chanpaisit, Nattapol and Si, Chenyang and Jiang, Yuming and Wang, Yaohui and Chen, Xinyuan and Chen, Ying-Cong and Wang, Limin and Lin, Dahua and Qiao, Yu and Liu, Ziwei},
     journal={arXiv preprint arXiv:2411.13503},
     year={2024}
 }

♥️ Acknowledgement

💪 VBench Contributors

Order is based on the time joining the project:

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Nattapol Chanpaisit, Xiaojie Xu, Qianli Ma, Ziyue Dong.

🤗 Open-Sourced Repositories

This project wouldn't be possible without the following open-sourced repositories: AMT, UMT, RAM, CLIP, RAFT, GRiT, IQA-PyTorch, ViCLIP, and LAION Aesthetic Predictor.

Name		Name	Last commit message	Last commit date
Latest commit History 362 Commits
asset		asset
bin		bin
competitions		competitions
pretrained		pretrained
prompts		prompts
sampled_videos		sampled_videos
scripts		scripts
submodules		submodules
vbench		vbench
vbench2_beta_i2v		vbench2_beta_i2v
vbench2_beta_long		vbench2_beta_long
vbench2_beta_trustworthiness		vbench2_beta_trustworthiness
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README-pypi.md		README-pypi.md
README.md		README.md
dimension_to_folder.json		dimension_to_folder.json
evaluate.py		evaluate.py
evaluate.sh		evaluate.sh
evaluate_i2v.py		evaluate_i2v.py
evaluate_trustworthy.py		evaluate_trustworthy.py
requirements.txt		requirements.txt
setup.py		setup.py
static_filter.py		static_filter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

🔥 Updates

📣 Overview

🎓 Evaluation Results

🏆 Leaderboard

📽️ Model Info

Evaluation Criterion

🔨 Installation

Install with pip

Install with git clone

Usage

[New] Evaluate Your Own Videos

Evaluation on the Standard Prompt Suite of VBench

Command Line

Python

Evaluation of Different Content Categories

command line

Example of Evaluating VideoCrafter-1.0

Submit to Leaderboard

How to Calculate Total Score

💎 Pre-Trained Models

📑 Prompt Suite

📑 Sampled Videos

🏄 Evaluation Method Suite

Before evaluating the temporal flickering dimension, it is necessary to filter out the static videos first.

✒️ Citation

♥️ Acknowledgement

💪 VBench Contributors

🤗 Open-Sourced Repositories

Related Links

About

Packages

Contributors 13

Languages

License

Vchitect/VBench

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

🔥 Updates

📣 Overview

🎓 Evaluation Results

🏆 Leaderboard

📽️ Model Info

Evaluation Criterion

🔨 Installation

Install with pip

Install with git clone

Usage

[New] Evaluate Your Own Videos

Evaluation on the Standard Prompt Suite of VBench

Command Line

Python

Evaluation of Different Content Categories

command line

Example of Evaluating VideoCrafter-1.0

Submit to Leaderboard

How to Calculate Total Score

💎 Pre-Trained Models

📑 Prompt Suite

📑 Sampled Videos

🏄 Evaluation Method Suite

Before evaluating the temporal flickering dimension, it is necessary to filter out the static videos first.

✒️ Citation

♥️ Acknowledgement

💪 VBench Contributors

🤗 Open-Sourced Repositories

Related Links

About

Topics

Resources

License

Stars

Watchers

Forks

Packages 0

Contributors 13

Languages

Packages