VisionReward

📃 Paper • 🖼 Dataset (Coming soon) • 🤗 HF Repo • 🌐 中文博客

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

VisionReward is a fine-grained and multi-dimensional reward model. We decompose human preferences in images and videos into multiple dimensions, each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accurate score. To address the challenges of video quality assessment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction.

Quick Start

Set Up the Environment

Following the commands below to prepare the environment:

pip install -r requirements.txt

Download the model

You can download the pre-trained VisionReward models for images and videos from the following Hugging Face repositories:

Image Reward Model: https://huggingface.co/THUDM/VisionReward-Image
Video Reward Model: https://huggingface.co/THUDM/VisionReward-Video

VQA Example

Use the following code to perform a checklist query. You can view the available questions for images and videos in VisionReward-Image/VisionReward_image_qa.txt and VisionReward-Video/VisionReward_video_qa.txt respectively.

python inference-image.py --bf16 --question [[your_question]]
# input: image_path + prompt + question
# output: yes/no

python inference-video.py --question [[your_question]]
# input: video_path + prompt + question
# output: yes/no

Using the model for scoring

Use the following code to score images/videos. The corresponding weights are in VisionReward-Image/weight.json and VisionReward-Video/weight.json.

python inference-image.py --bf16 --score 
# input: image_path + prompt
# output: score

python inference-video.py --score
# input: video_path + prompt
# output: score

Using the model for comparing two videos

Use the following code to compare two videos. The corresponding weights are in VisionReward-Video/weight.json.

python inference-video.py --compare
# input: video_path1 + video_path2 + prompt
# output: better_video

Demos of VisionReward

Citation

@misc{xu2024visionrewardfinegrainedmultidimensionalhuman,
      title={VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation}, 
      author={Jiazheng Xu and Yu Huang and Jiale Cheng and Yuanming Yang and Jiajun Xu and Yuan Wang and Wenbo Duan and Shen Yang and Qunlin Jin and Shurun Li and Jiayan Teng and Zhuoyi Yang and Wendi Zheng and Xiao Liu and Ming Ding and Xiaohan Zhang and Xiaotao Gu and Shiyu Huang and Minlie Huang and Jie Tang and Yuxiao Dong},
      year={2024},
      eprint={2412.21059},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.21059}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
VisionReward-Image		VisionReward-Image
VisionReward-Video		VisionReward-Video
asset		asset
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
inference-image.py		inference-image.py
inference-video.py		inference-video.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VisionReward

Quick Start

Set Up the Environment

Download the model

VQA Example

Using the model for scoring

Using the model for comparing two videos

Demos of VisionReward

Citation

About

Releases

Packages

Contributors 2

Languages

License

THUDM/VisionReward

Folders and files

Latest commit

History

Repository files navigation

VisionReward

Quick Start

Set Up the Environment

Download the model

VQA Example

Using the model for scoring

Using the model for comparing two videos

Demos of VisionReward

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages