-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
code to inference the video qulaity on my own video data #3
Comments
change the path to "q-future/VQA-UGC-Scorer_qwen" and make sure your network can connect to the huggingface(otherwise, use the mirror)and the model will be automaticly downloaded and loaded. |
I try to make a simple demo.py to test on my own video data. However, it seems current codebase couldn't run at all. the demo.py
the error like following
|
Sorry,I have overlooked one crucial point in my code, the name of the model weight folder should contain the string “llava” (see line 279 in llava\model\builder.py),so please download the latest version of the model on huggingface or simply change the name of your folder of the model weight. |
thanks for reply. |
Sure, it is on your decision.
Ziheng Jia
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2024年12月16日(星期一) 下午5:29
收件人: ***@***.***>;
抄送: "Ziheng ***@***.***>; ***@***.***>;
主题: Re: [Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment] code to inference the video qulaity on my own video data (Issue #3)
Sorry,I have overlooked one crucial point in my code, the name of the model weight folder should contain the string “llava” (see line 279 in llava\model\builder.py),so please download the latest version of the model on huggingface or simply change the name of your folder of the model weight.
thanks for reply.
can i replace "if "llava" in model_name.lower() or is_multimodal:" with "if "VQA-UGC-Scorer" in model_name:" for simple test
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
thanks, it fix the bug before. However, another bug was triggered. It seems like current code base might be developed right now. Hope it could be suitable for common users to run one day.
|
It probably because of your gpu memory is not enough so the weight has been overfit to cpu. What is the device you are using to do the test?
Ziheng Jia
***@***.***
…------------------ 原始邮件 ------------------
发件人: ***@***.***>;
发送时间: 2024年12月16日(星期一) 下午5:37
收件人: ***@***.***>;
抄送: "Ziheng ***@***.***>; ***@***.***>;
主题: Re: [Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment] code to inference the video qulaity on my own video data (Issue #3)
thanks, it fix the bug before. However, another bug was triggered. It seems like current code base might be developed right now. Hope it could be suitable for common users to run one day.
Traceback (most recent call last): File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 178, in <module> eval_model(args) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 155, in eval_model output_logits = model(input_ids, File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/language_model/llava_qwen.py", line 84, in forward (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 262, in prepare_inputs_labels_for_multimodal encoded_image_features,encoded_slowfast_features = self.encode_images(images) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 192, in encode_images image_features = self.get_model().get_vision_tower()(images[1]) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/multimodal_encoder/siglip_encoder.py", line 586, in forward image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 364, in pre_forward return send_to_device(args, self.execution_device), send_to_device( File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in send_to_device return honor_type( File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 82, in honor_type return type(obj)(generator) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 176, in <genexpr> tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 156, in send_to_device return tensor.to(device, non_blocking=non_blocking) NotImplementedError: Cannot copy out of meta tensor; no data!
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
I tested it on my Mac M1 before with cpu as device. As you suggest, I figured out the error before after using V100 GPU on ubuntu system. However, I met more errors or bugs. Considering the under-develop state, the VQA and q_align might not be suitable for large-scale inference on custom data now. Thanks the authors the same. I just post my ugly code , it is loacte in the same folder as "./llava/eval/model_score_UGC.py". my inference code is "python ./llava/eval/demo.py --video-path ./0000.mp4 --device cuda", the demo.py is as following. Also notice I also modify the llava\model\builder.py as author suggested.
|
Could you please tell me the new encountered bug or error?
Ziheng Jia
***@***.***
…------------------ 原始邮件 ------------------
发件人: "Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment" ***@***.***>;
发送时间: 2024年12月16日(星期一) 晚上8:07
***@***.***>;
抄送: "Ziheng ***@***.******@***.***>;
主题: Re: [Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment] code to inference the video qulaity on my own video data (Issue #3)
I tested it on my Mac M1 before with cpu as device. As you suggest, I figured out the error before after using V100 GPU on ubuntu system. However, I met more errors or bugs. Considering the under-develop state, the VQA and q_align might not suitable for large-scale inference on custom data now.
Thanks the authors the same. I just post my ugly code , it is loacte in the same folder as "./llava/eval/model_score_UGC.py". my inference code is "python ./llava/eval/demo.py --video-path ./0000.mp4 --device cuda", the demo.py is as following. Also notice I also modify the llava\model\builder.py as author suggested.
import argparse import os import sys sys.path.append('./') sys.path.append('../') import json from tqdm import tqdm import numpy as np import shortuuid os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' os.environ['HF_ENDPOINT']= 'https://hf-mirror.com' import torch from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.conversation import conv_templates, SeparatorStyle from llava.model.builder import load_pretrained_model from llava.utils import disable_torch_init from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria from torchvision import transforms from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX from typing import Dict, Optional, Sequence, List import transformers import re from collections import defaultdict from PIL import Image import math from scipy.stats import spearmanr, pearsonr def load_video(video_file, video_fps): from decord import VideoReader,cpu vr = VideoReader(video_file, ctx=cpu(0), num_threads=1) # frame_idx=[] # for ii in range(len(vr)//round(vr.get_avg_fps())): # total_frame_num = round(vr.get_avg_fps()) # avg_fps = round(vr.get_avg_fps() / video_fps) # # total_frame_num=len(vr)//avg_fps*avg_fps # frame_idx.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)]) # total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps())) # avg_fps = round(vr.get_avg_fps() / video_fps) # # total_frame_num=len(vr)//avg_fps*avg_fps # frame_idx.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)]) # if len(frame_idx) > 200: # uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int) # frame_idx = uniform_sampled_frames.tolist() frames = vr.get_batch(list(range(len(vr)))).asnumpy() frame_idx1 = [] video_fps=1 for ii in range(len(vr)//round(vr.get_avg_fps())): # print(video_file) total_frame_num = round(vr.get_avg_fps()) avg_fps = round(vr.get_avg_fps() / video_fps) # total_frame_num=len(vr)//avg_fps*avg_fps frame_idx1.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)]) total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps())) avg_fps = round(vr.get_avg_fps() / video_fps) # total_frame_num=len(vr)//avg_fps*avg_fps frame_idx1.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)]) # if len(frame_idx) > 200: # uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int) # frame_idx = uniform_sampled_frames.tolist() return [Image.fromarray(frames[i]) for i in range(len((vr)))],frame_idx1 # return frame_idx,len(frame_idx)/video_fps def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict: roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"} im_start, im_end = tokenizer.additional_special_tokens_ids nl_tokens = tokenizer("\n").input_ids _system = tokenizer("system").input_ids + nl_tokens _user = tokenizer("user").input_ids + nl_tokens _assistant = tokenizer("assistant").input_ids + nl_tokens # Apply prompt templates input_ids, targets = [], [] source = sources # if roles[source[0]["from"]] != roles["human"]: # source = source[1:] input_id, target = [], [] system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens input_id += system target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens assert len(input_id) == len(target) for j, sentence in enumerate(source): if j==0: role = "<|im_start|>user" else: role = "<|im_start|>assistant" if has_image and sentence is not None and "<image>" in sentence: num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence)) texts = sentence.split('<image>') _input_id = tokenizer(role).input_ids + nl_tokens for i,text in enumerate(texts): _input_id += tokenizer(text).input_ids if i<len(texts)-1: _input_id += [IMAGE_TOKEN_INDEX] _input_id += [im_end] + nl_tokens assert sum([i==IMAGE_TOKEN_INDEX for i in _input_id])==num_image else: if sentence["value"] is None: _input_id = tokenizer(role).input_ids + nl_tokens else: _input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens input_id += _input_id if role == "<|im_start|>user": _target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens elif role == "<|im_start|>assistant": _target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens else: raise NotImplementedError target += _target input_ids.append(input_id) targets.append(target) input_ids = torch.tensor(input_ids, dtype=torch.long) targets = torch.tensor(targets, dtype=torch.long) return input_ids def eval_model(args): # Model disable_torch_init() model_path = os.path.expanduser(args.model_path) model_name = get_model_name_from_path(model_path) tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name, device_map=args.device) inp = "The key frames of this video are:" + "\n" + DEFAULT_IMAGE_TOKEN + ". And the motion feature of the video is" + "\n" + DEFAULT_IMAGE_TOKEN + ". How would you rate the quality of this video?" print("exists = ", os.path.exists(args.video_path)) image, frame_idx = load_video(args.video_path, 24) cur_prompt = args.extra_prompt + inp # print(cur_prompt) conv = conv_templates[args.conv_mode].copy() conv.append_message(conv.roles[0], cur_prompt) conv.append_message(conv.roles[1], "The quality of the video is") prompt = conv.get_prompt() input_ids = preprocess_qwen([cur_prompt, {'from': 'gpt', 'value': "The quality of the video is"}], tokenizer, has_image=True).to(args.device) image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[ 'pixel_values'] image_tensor1 = \ image_processor.preprocess([image[frame_idx[i]] for i in range(len(frame_idx))], return_tensors='pt')[ 'pixel_values'] image_tensors = [[image_tensor[:image_tensor.shape[0] // 4 * 4].half().to(args.device)], [image_tensor1.half().to(args.device)]] stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) output_logits = model(input_ids, images=image_tensors)["logits"][:, -3] print(output_logits) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--model-path", type=str, default="q-future/VQA-UGC-Scorer_qwen") parser.add_argument("--video-path", type=str, default="/Users/momiao/Desktop/Projects/video_samples/good/0000.mp4") parser.add_argument("--device", type=str, default="cuda") parser.add_argument("--model-base", type=str, default=None) parser.add_argument("--image-folder", type=str, default="") parser.add_argument("--extra-prompt", type=str, default="") parser.add_argument("--question-file", type=str, default="tables/question.jsonl") parser.add_argument("--answers-file", type=str, default="answer.jsonl") parser.add_argument("--conv-mode", type=str, default="llava_v1") parser.add_argument("--num-chunks", type=int, default=1) parser.add_argument("--chunk-idx", type=int, default=0) parser.add_argument("--temperature", type=float, default=0.2) parser.add_argument("--top_p", type=float, default=None) parser.add_argument("--num_beams", type=int, default=1) parser.add_argument("--test_size", type=int, default=10000000) args = parser.parse_args() eval_model(args)
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Additionally, if you are from China, then feel free to ask in Chinese to facilitate the solution of the problems. |
您好, 您是否能提供一个基于Visual-Question-Answering-for-Video-Quality-Assessment项目的能够评估in-the-wild视频的demo.py。我是视频质量评估领域内的新手,魔改您的代码使得它能正常运行会很有挑战。 另外一个Q-align的项目,看起来star数很多,有更多人关注,可能更适合我的场景需要。但是huggingface space上的demo不能正常运行,严格执行repo中环境配置和官方代码也不能跑通。不知道您是否会考虑优先维护更新一下Q-align的项目呢? 目前我想做的是过滤数亿条视频数据中低质量的,我希望能实际测试当下在该领域在in-the-wild video上最先进的SOTA模型。看腾讯混元文生视频使用的是您团队之前做的Dover,那个参数量小可以跑通。本来想看看Q-align和Visual-Question-Answering-for-Video-Quality-Assessment有没有更好的效果,但现在看起来对新手挑战比较大。 希望得到您的建议和帮助,多谢您~ |
我们会尽快对代码进行润色修改并尽量满足快速上手的要求,感谢支持。另外能否提供一下您评估的视频时长和帧率,正常情况下单段十秒左右视频打分只需要不到20G显存,不会出现爆显存的情况 |
Thanks for such project.
I can not wait to test the power of your model which claims that it is excellent!
But it seems "python ./llava/eval/model_score_UGC.py" dose not work, I don't know how to get the model pth file.
the model path in your code is "/DATA/DATA2/jzh/video_benchmark/LLaVA-NeXT-main/llava-ov-chat-qwen2_slowfast_base", I can not download it.
The text was updated successfully, but these errors were encountered: