code to inference the video qulaity on my own video data #3

MengHao666 · 2024-12-12T11:34:34Z

Thanks for such project.
I can not wait to test the power of your model which claims that it is excellent!
But it seems "python ./llava/eval/model_score_UGC.py" dose not work, I don't know how to get the model pth file.
the model path in your code is "/DATA/DATA2/jzh/video_benchmark/LLaVA-NeXT-main/llava-ov-chat-qwen2_slowfast_base", I can not download it.

jzhws · 2024-12-12T13:39:19Z

change the path to "q-future/VQA-UGC-Scorer_qwen" and make sure your network can connect to the huggingface(otherwise, use the mirror）and the model will be automaticly downloaded and loaded.

MengHao666 · 2024-12-16T09:12:01Z

I try to make a simple demo.py to test on my own video data. However, it seems current codebase couldn't run at all.

the demo.py

import argparse

import os
import sys
sys.path.append('./')
sys.path.append('../')
import json
from tqdm import tqdm
import numpy as np
import shortuuid
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
os.environ['HF_ENDPOINT']= 'https://hf-mirror.com'
import torch
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from torchvision import transforms
from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from typing import Dict, Optional, Sequence, List
import transformers
import re
from collections import defaultdict
from PIL import Image
import math
from scipy.stats import spearmanr, pearsonr

def load_video(video_file, video_fps):
    from decord import VideoReader,cpu
    vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
    # frame_idx=[]
    # for ii in range(len(vr)//round(vr.get_avg_fps())):
    #     total_frame_num = round(vr.get_avg_fps())
    #     avg_fps = round(vr.get_avg_fps() / video_fps)
    #     # total_frame_num=len(vr)//avg_fps*avg_fps
    #     frame_idx.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    # total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    # avg_fps = round(vr.get_avg_fps() / video_fps)
    # # total_frame_num=len(vr)//avg_fps*avg_fps
    # frame_idx.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()
    frames = vr.get_batch(list(range(len(vr)))).asnumpy()
    frame_idx1 = []
    video_fps=1
    for ii in range(len(vr)//round(vr.get_avg_fps())):
        # print(video_file)
        total_frame_num = round(vr.get_avg_fps())
        avg_fps = round(vr.get_avg_fps() / video_fps)
        # total_frame_num=len(vr)//avg_fps*avg_fps
        frame_idx1.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    avg_fps = round(vr.get_avg_fps() / video_fps)
    # total_frame_num=len(vr)//avg_fps*avg_fps
    frame_idx1.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()

    return [Image.fromarray(frames[i]) for i in range(len((vr)))],frame_idx1
    # return frame_idx,len(frame_idx)/video_fps


def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
    roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}

    im_start, im_end = tokenizer.additional_special_tokens_ids
    nl_tokens = tokenizer("\n").input_ids
    _system = tokenizer("system").input_ids + nl_tokens
    _user = tokenizer("user").input_ids + nl_tokens
    _assistant = tokenizer("assistant").input_ids + nl_tokens

    # Apply prompt templates
    input_ids, targets = [], []

    source = sources
    # if roles[source[0]["from"]] != roles["human"]:
    #     source = source[1:]

    input_id, target = [], []
    system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
    input_id += system
    target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens
    assert len(input_id) == len(target)
    for j, sentence in enumerate(source):
        if j==0:
            role = "<|im_start|>user"
        else:
            role = "<|im_start|>assistant"
        if has_image and sentence is not None and "<image>" in sentence:
            num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence))
            texts = sentence.split('<image>')
            _input_id = tokenizer(role).input_ids + nl_tokens
            for i,text in enumerate(texts):
                _input_id += tokenizer(text).input_ids
                if i<len(texts)-1:
                    _input_id += [IMAGE_TOKEN_INDEX]
            _input_id += [im_end] + nl_tokens
            assert sum([i==IMAGE_TOKEN_INDEX for i in _input_id])==num_image
        else:
            if sentence["value"] is None:
                _input_id = tokenizer(role).input_ids + nl_tokens
            else:
                _input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
        input_id += _input_id
        if role == "<|im_start|>user":
            _target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens
        elif role == "<|im_start|>assistant":
            _target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens
        else:
            raise NotImplementedError
        target += _target

    input_ids.append(input_id)
    targets.append(target)
    input_ids = torch.tensor(input_ids, dtype=torch.long)
    targets = torch.tensor(targets, dtype=torch.long)
    return input_ids


def eval_model(args):

    # Model
    disable_torch_init()
    model_path = os.path.expanduser(args.model_path)
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)

    inp = "The key frames of this video are:" + "\n" + DEFAULT_IMAGE_TOKEN + ". And the motion feature of the video is" + "\n" + DEFAULT_IMAGE_TOKEN + ". How would you rate the quality of this video?"
    image, frame_idx = load_video(args.video_path, 24)
    cur_prompt = args.extra_prompt + inp
    # print(cur_prompt)
    conv = conv_templates[args.conv_mode].copy()
    conv.append_message(conv.roles[0], cur_prompt)
    conv.append_message(conv.roles[1], "The quality of the video is")
    prompt = conv.get_prompt()

    input_ids = preprocess_qwen([cur_prompt, {'from': 'gpt', 'value': "The quality of the video is"}], tokenizer,
                                has_image=True).to(args.device)

    image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[
        'pixel_values']
    image_tensor1 = \
        image_processor.preprocess([image[frame_idx[i]] for i in range(len(frame_idx))], return_tensors='pt')[
            'pixel_values']

    image_tensors = [[image_tensor[:image_tensor.shape[0] // 4 * 4].half().to(args.device)], [image_tensor1.half().to(args.device)]]

    # stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    # keywords = [stop_str]
    # stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    # output_logits = model(input_ids,
    #                       images=image_tensors)["logits"][:, -3]

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, default="q-future/VQA-UGC-Scorer_qwen")
    parser.add_argument("--video-path", type=str, default="/Users/momiao/Desktop/Projects/video_samples/good/0000.mp4")
    parser.add_argument("--device", type=str, default="cuda")
    parser.add_argument("--model-base", type=str, default=None)
    parser.add_argument("--image-folder", type=str, default="")
    parser.add_argument("--extra-prompt", type=str, default="")
    parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
    parser.add_argument("--answers-file", type=str, default="answer.jsonl")
    parser.add_argument("--conv-mode", type=str, default="llava_v1")
    parser.add_argument("--num-chunks", type=int, default=1)
    parser.add_argument("--chunk-idx", type=int, default=0)
    parser.add_argument("--temperature", type=float, default=0.2)
    parser.add_argument("--top_p", type=float, default=None)
    parser.add_argument("--num_beams", type=int, default=1)
    parser.add_argument("--test_size", type=int, default=10000000)
    args = parser.parse_args()

    eval_model(args)

the error like following

Traceback (most recent call last):
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 176, in <module>
    eval_model(args)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 143, in eval_model
    image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[
AttributeError: 'NoneType' object has no attribute 'preprocess'

jzhws · 2024-12-16T09:21:37Z

Sorry，I have overlooked one crucial point in my code, the name of the model weight folder should contain the string “llava” (see line 279 in llava\model\builder.py），so please download the latest version of the model on huggingface or simply change the name of your folder of the model weight.

MengHao666 · 2024-12-16T09:28:59Z

Sorry，I have overlooked one crucial point in my code, the name of the model weight folder should contain the string “llava” (see line 279 in llava\model\builder.py），so please download the latest version of the model on huggingface or simply change the name of your folder of the model weight.

thanks for reply.
can i replace "if "llava" in model_name.lower() or is_multimodal:" with "if "VQA-UGC-Scorer" in model_name:" for simple test

jzhws · 2024-12-16T09:30:20Z

Sure, it is on your decision.  Ziheng Jia ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2024年12月16日(星期一) 下午5:29 收件人: ***@***.***>; 抄送: "Ziheng ***@***.***>; ***@***.***>; 主题: Re: [Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment] code to inference the video qulaity on my own video data (Issue #3) Sorry，I have overlooked one crucial point in my code, the name of the model weight folder should contain the string “llava” (see line 279 in llava\model\builder.py），so please download the latest version of the model on huggingface or simply change the name of your folder of the model weight. thanks for reply. can i replace "if "llava" in model_name.lower() or is_multimodal:" with "if "VQA-UGC-Scorer" in model_name:" for simple test — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

MengHao666 · 2024-12-16T09:37:25Z

thanks, it fix the bug before. However, another bug was triggered. It seems like current code base might be developed right now. Hope it could be suitable for common users to run one day.

Traceback (most recent call last):
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 178, in <module>
    eval_model(args)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 155, in eval_model
    output_logits = model(input_ids,
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/language_model/llava_qwen.py", line 84, in forward
    (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 262, in prepare_inputs_labels_for_multimodal
    encoded_image_features,encoded_slowfast_features = self.encode_images(images)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 192, in encode_images
    image_features = self.get_model().get_vision_tower()(images[1])
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/multimodal_encoder/siglip_encoder.py", line 586, in forward
    image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 364, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in send_to_device
    return honor_type(
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 82, in honor_type
    return type(obj)(generator)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 176, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 156, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

jzhws · 2024-12-16T09:39:42Z

It probably because of your gpu memory is not enough so the weight has been overfit to cpu. What is the device you are using to do the test？   Ziheng Jia ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: ***@***.***>; 发送时间: 2024年12月16日(星期一) 下午5:37 收件人: ***@***.***>; 抄送: "Ziheng ***@***.***>; ***@***.***>; 主题: Re: [Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment] code to inference the video qulaity on my own video data (Issue #3) thanks, it fix the bug before. However, another bug was triggered. It seems like current code base might be developed right now. Hope it could be suitable for common users to run one day. Traceback (most recent call last): File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 178, in <module> eval_model(args) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 155, in eval_model output_logits = model(input_ids, File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/language_model/llava_qwen.py", line 84, in forward (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 262, in prepare_inputs_labels_for_multimodal encoded_image_features,encoded_slowfast_features = self.encode_images(images) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 192, in encode_images image_features = self.get_model().get_vision_tower()(images[1]) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward output = module._old_forward(*args, **kwargs) File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/multimodal_encoder/siglip_encoder.py", line 586, in forward image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 364, in pre_forward return send_to_device(args, self.execution_device), send_to_device( File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in send_to_device return honor_type( File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 82, in honor_type return type(obj)(generator) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 176, in <genexpr> tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor) File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 156, in send_to_device return tensor.to(device, non_blocking=non_blocking) NotImplementedError: Cannot copy out of meta tensor; no data! — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

MengHao666 · 2024-12-16T12:07:29Z

I tested it on my Mac M1 before with cpu as device. As you suggest, I figured out the error before after using V100 GPU on ubuntu system. However, I met more errors or bugs. Considering the under-develop state, the VQA and q_align might not be suitable for large-scale inference on custom data now.

Thanks the authors the same. I just post my ugly code , it is loacte in the same folder as "./llava/eval/model_score_UGC.py". my inference code is "python ./llava/eval/demo.py --video-path ./0000.mp4 --device cuda", the demo.py is as following. Also notice I also modify the llava\model\builder.py as author suggested.

import argparse

import os
import sys
sys.path.append('./')
sys.path.append('../')
import json
from tqdm import tqdm
import numpy as np
import shortuuid
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
os.environ['HF_ENDPOINT']= 'https://hf-mirror.com'
import torch
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from torchvision import transforms
from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from typing import Dict, Optional, Sequence, List
import transformers
import re
from collections import defaultdict
from PIL import Image
import math
from scipy.stats import spearmanr, pearsonr

def load_video(video_file, video_fps):
    from decord import VideoReader,cpu
    vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
    # frame_idx=[]
    # for ii in range(len(vr)//round(vr.get_avg_fps())):
    #     total_frame_num = round(vr.get_avg_fps())
    #     avg_fps = round(vr.get_avg_fps() / video_fps)
    #     # total_frame_num=len(vr)//avg_fps*avg_fps
    #     frame_idx.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    # total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    # avg_fps = round(vr.get_avg_fps() / video_fps)
    # # total_frame_num=len(vr)//avg_fps*avg_fps
    # frame_idx.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()
    frames = vr.get_batch(list(range(len(vr)))).asnumpy()
    frame_idx1 = []
    video_fps=1
    for ii in range(len(vr)//round(vr.get_avg_fps())):
        # print(video_file)
        total_frame_num = round(vr.get_avg_fps())
        avg_fps = round(vr.get_avg_fps() / video_fps)
        # total_frame_num=len(vr)//avg_fps*avg_fps
        frame_idx1.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    avg_fps = round(vr.get_avg_fps() / video_fps)
    # total_frame_num=len(vr)//avg_fps*avg_fps
    frame_idx1.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()

    return [Image.fromarray(frames[i]) for i in range(len((vr)))],frame_idx1
    # return frame_idx,len(frame_idx)/video_fps


def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
    roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}

    im_start, im_end = tokenizer.additional_special_tokens_ids
    nl_tokens = tokenizer("\n").input_ids
    _system = tokenizer("system").input_ids + nl_tokens
    _user = tokenizer("user").input_ids + nl_tokens
    _assistant = tokenizer("assistant").input_ids + nl_tokens

    # Apply prompt templates
    input_ids, targets = [], []

    source = sources
    # if roles[source[0]["from"]] != roles["human"]:
    #     source = source[1:]

    input_id, target = [], []
    system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
    input_id += system
    target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens
    assert len(input_id) == len(target)
    for j, sentence in enumerate(source):
        if j==0:
            role = "<|im_start|>user"
        else:
            role = "<|im_start|>assistant"
        if has_image and sentence is not None and "<image>" in sentence:
            num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence))
            texts = sentence.split('<image>')
            _input_id = tokenizer(role).input_ids + nl_tokens
            for i,text in enumerate(texts):
                _input_id += tokenizer(text).input_ids
                if i<len(texts)-1:
                    _input_id += [IMAGE_TOKEN_INDEX]
            _input_id += [im_end] + nl_tokens
            assert sum([i==IMAGE_TOKEN_INDEX for i in _input_id])==num_image
        else:
            if sentence["value"] is None:
                _input_id = tokenizer(role).input_ids + nl_tokens
            else:
                _input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
        input_id += _input_id
        if role == "<|im_start|>user":
            _target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens
        elif role == "<|im_start|>assistant":
            _target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens
        else:
            raise NotImplementedError
        target += _target

    input_ids.append(input_id)
    targets.append(target)
    input_ids = torch.tensor(input_ids, dtype=torch.long)
    targets = torch.tensor(targets, dtype=torch.long)
    return input_ids


def eval_model(args):

    # Model
    disable_torch_init()
    model_path = os.path.expanduser(args.model_path)
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name, device_map=args.device)

    inp = "The key frames of this video are:" + "\n" + DEFAULT_IMAGE_TOKEN + ". And the motion feature of the video is" + "\n" + DEFAULT_IMAGE_TOKEN + ". How would you rate the quality of this video?"
    print("exists = ", os.path.exists(args.video_path))
    image, frame_idx = load_video(args.video_path, 24)
    cur_prompt = args.extra_prompt + inp
    # print(cur_prompt)
    conv = conv_templates[args.conv_mode].copy()
    conv.append_message(conv.roles[0], cur_prompt)
    conv.append_message(conv.roles[1], "The quality of the video is")
    prompt = conv.get_prompt()

    input_ids = preprocess_qwen([cur_prompt, {'from': 'gpt', 'value': "The quality of the video is"}], tokenizer,
                                has_image=True).to(args.device)

    image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[
        'pixel_values']
    image_tensor1 = \
        image_processor.preprocess([image[frame_idx[i]] for i in range(len(frame_idx))], return_tensors='pt')[
            'pixel_values']

    image_tensors = [[image_tensor[:image_tensor.shape[0] // 4 * 4].half().to(args.device)], [image_tensor1.half().to(args.device)]]

    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    output_logits = model(input_ids,
                          images=image_tensors)["logits"][:, -3]
    print(output_logits)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, default="q-future/VQA-UGC-Scorer_qwen")
    parser.add_argument("--video-path", type=str, default="/Users/momiao/Desktop/Projects/video_samples/good/0000.mp4")
    parser.add_argument("--device", type=str, default="cuda")
    parser.add_argument("--model-base", type=str, default=None)
    parser.add_argument("--image-folder", type=str, default="")
    parser.add_argument("--extra-prompt", type=str, default="")
    parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
    parser.add_argument("--answers-file", type=str, default="answer.jsonl")
    parser.add_argument("--conv-mode", type=str, default="llava_v1")
    parser.add_argument("--num-chunks", type=int, default=1)
    parser.add_argument("--chunk-idx", type=int, default=0)
    parser.add_argument("--temperature", type=float, default=0.2)
    parser.add_argument("--top_p", type=float, default=None)
    parser.add_argument("--num_beams", type=int, default=1)
    parser.add_argument("--test_size", type=int, default=10000000)
    args = parser.parse_args()

    eval_model(args)

jzhws · 2024-12-16T12:12:10Z

Could you please tell me the new encountered bug or error？  Ziheng Jia ***@***.***  

…

------------------ 原始邮件 ------------------ 发件人: "Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment" ***@***.***>; 发送时间: 2024年12月16日(星期一) 晚上8:07 ***@***.***>; 抄送: "Ziheng ***@***.******@***.***>; 主题: Re: [Q-Future/Visual-Question-Answering-for-Video-Quality-Assessment] code to inference the video qulaity on my own video data (Issue #3) I tested it on my Mac M1 before with cpu as device. As you suggest, I figured out the error before after using V100 GPU on ubuntu system. However, I met more errors or bugs. Considering the under-develop state, the VQA and q_align might not suitable for large-scale inference on custom data now. Thanks the authors the same. I just post my ugly code , it is loacte in the same folder as "./llava/eval/model_score_UGC.py". my inference code is "python ./llava/eval/demo.py --video-path ./0000.mp4 --device cuda", the demo.py is as following. Also notice I also modify the llava\model\builder.py as author suggested. import argparse import os import sys sys.path.append('./') sys.path.append('../') import json from tqdm import tqdm import numpy as np import shortuuid os.environ['CUDA_VISIBLE_DEVICES'] = '0,1' os.environ['HF_ENDPOINT']= 'https://hf-mirror.com' import torch from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.conversation import conv_templates, SeparatorStyle from llava.model.builder import load_pretrained_model from llava.utils import disable_torch_init from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria from torchvision import transforms from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX from typing import Dict, Optional, Sequence, List import transformers import re from collections import defaultdict from PIL import Image import math from scipy.stats import spearmanr, pearsonr def load_video(video_file, video_fps): from decord import VideoReader,cpu vr = VideoReader(video_file, ctx=cpu(0), num_threads=1) # frame_idx=[] # for ii in range(len(vr)//round(vr.get_avg_fps())): # total_frame_num = round(vr.get_avg_fps()) # avg_fps = round(vr.get_avg_fps() / video_fps) # # total_frame_num=len(vr)//avg_fps*avg_fps # frame_idx.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)]) # total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps())) # avg_fps = round(vr.get_avg_fps() / video_fps) # # total_frame_num=len(vr)//avg_fps*avg_fps # frame_idx.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)]) # if len(frame_idx) > 200: # uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int) # frame_idx = uniform_sampled_frames.tolist() frames = vr.get_batch(list(range(len(vr)))).asnumpy() frame_idx1 = [] video_fps=1 for ii in range(len(vr)//round(vr.get_avg_fps())): # print(video_file) total_frame_num = round(vr.get_avg_fps()) avg_fps = round(vr.get_avg_fps() / video_fps) # total_frame_num=len(vr)//avg_fps*avg_fps frame_idx1.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)]) total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps())) avg_fps = round(vr.get_avg_fps() / video_fps) # total_frame_num=len(vr)//avg_fps*avg_fps frame_idx1.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)]) # if len(frame_idx) > 200: # uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int) # frame_idx = uniform_sampled_frames.tolist() return [Image.fromarray(frames[i]) for i in range(len((vr)))],frame_idx1 # return frame_idx,len(frame_idx)/video_fps def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict: roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"} im_start, im_end = tokenizer.additional_special_tokens_ids nl_tokens = tokenizer("\n").input_ids _system = tokenizer("system").input_ids + nl_tokens _user = tokenizer("user").input_ids + nl_tokens _assistant = tokenizer("assistant").input_ids + nl_tokens # Apply prompt templates input_ids, targets = [], [] source = sources # if roles[source[0]["from"]] != roles["human"]: # source = source[1:] input_id, target = [], [] system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens input_id += system target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens assert len(input_id) == len(target) for j, sentence in enumerate(source): if j==0: role = "<|im_start|>user" else: role = "<|im_start|>assistant" if has_image and sentence is not None and "<image>" in sentence: num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence)) texts = sentence.split('<image>') _input_id = tokenizer(role).input_ids + nl_tokens for i,text in enumerate(texts): _input_id += tokenizer(text).input_ids if i<len(texts)-1: _input_id += [IMAGE_TOKEN_INDEX] _input_id += [im_end] + nl_tokens assert sum([i==IMAGE_TOKEN_INDEX for i in _input_id])==num_image else: if sentence["value"] is None: _input_id = tokenizer(role).input_ids + nl_tokens else: _input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens input_id += _input_id if role == "<|im_start|>user": _target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens elif role == "<|im_start|>assistant": _target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens else: raise NotImplementedError target += _target input_ids.append(input_id) targets.append(target) input_ids = torch.tensor(input_ids, dtype=torch.long) targets = torch.tensor(targets, dtype=torch.long) return input_ids def eval_model(args): # Model disable_torch_init() model_path = os.path.expanduser(args.model_path) model_name = get_model_name_from_path(model_path) tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name, device_map=args.device) inp = "The key frames of this video are:" + "\n" + DEFAULT_IMAGE_TOKEN + ". And the motion feature of the video is" + "\n" + DEFAULT_IMAGE_TOKEN + ". How would you rate the quality of this video?" print("exists = ", os.path.exists(args.video_path)) image, frame_idx = load_video(args.video_path, 24) cur_prompt = args.extra_prompt + inp # print(cur_prompt) conv = conv_templates[args.conv_mode].copy() conv.append_message(conv.roles[0], cur_prompt) conv.append_message(conv.roles[1], "The quality of the video is") prompt = conv.get_prompt() input_ids = preprocess_qwen([cur_prompt, {'from': 'gpt', 'value': "The quality of the video is"}], tokenizer, has_image=True).to(args.device) image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[ 'pixel_values'] image_tensor1 = \ image_processor.preprocess([image[frame_idx[i]] for i in range(len(frame_idx))], return_tensors='pt')[ 'pixel_values'] image_tensors = [[image_tensor[:image_tensor.shape[0] // 4 * 4].half().to(args.device)], [image_tensor1.half().to(args.device)]] stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) output_logits = model(input_ids, images=image_tensors)["logits"][:, -3] print(output_logits) if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument("--model-path", type=str, default="q-future/VQA-UGC-Scorer_qwen") parser.add_argument("--video-path", type=str, default="/Users/momiao/Desktop/Projects/video_samples/good/0000.mp4") parser.add_argument("--device", type=str, default="cuda") parser.add_argument("--model-base", type=str, default=None) parser.add_argument("--image-folder", type=str, default="") parser.add_argument("--extra-prompt", type=str, default="") parser.add_argument("--question-file", type=str, default="tables/question.jsonl") parser.add_argument("--answers-file", type=str, default="answer.jsonl") parser.add_argument("--conv-mode", type=str, default="llava_v1") parser.add_argument("--num-chunks", type=int, default=1) parser.add_argument("--chunk-idx", type=int, default=0) parser.add_argument("--temperature", type=float, default=0.2) parser.add_argument("--top_p", type=float, default=None) parser.add_argument("--num_beams", type=int, default=1) parser.add_argument("--test_size", type=int, default=10000000) args = parser.parse_args() eval_model(args) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

jzhws · 2024-12-16T12:17:00Z

Additionally， if you are from China， then feel free to ask in Chinese to facilitate the solution of the problems.

MengHao666 · 2024-12-17T01:42:55Z

您好，

您是否能提供一个基于Visual-Question-Answering-for-Video-Quality-Assessment项目的能够评估in-the-wild视频的demo.py。我是视频质量评估领域内的新手，魔改您的代码使得它能正常运行会很有挑战。
我执行上面的代码，环境是ubuntu，4块V100，如果使用cuda会OOM；如果使用cpu则会报错“RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'”，应该是内部要把模型和数据再处理成fp16才能跑通。

另外一个Q-align的项目，看起来star数很多，有更多人关注，可能更适合我的场景需要。但是huggingface space上的demo不能正常运行，严格执行repo中环境配置和官方代码也不能跑通。不知道您是否会考虑优先维护更新一下Q-align的项目呢？

目前我想做的是过滤数亿条视频数据中低质量的，我希望能实际测试当下在该领域在in-the-wild video上最先进的SOTA模型。看腾讯混元文生视频使用的是您团队之前做的Dover，那个参数量小可以跑通。本来想看看Q-align和Visual-Question-Answering-for-Video-Quality-Assessment有没有更好的效果，但现在看起来对新手挑战比较大。

希望得到您的建议和帮助，多谢您～

jzhws · 2024-12-18T15:04:43Z

我们会尽快对代码进行润色修改并尽量满足快速上手的要求，感谢支持。另外能否提供一下您评估的视频时长和帧率，正常情况下单段十秒左右视频打分只需要不到20G显存，不会出现爆显存的情况

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code to inference the video qulaity on my own video data #3

code to inference the video qulaity on my own video data #3

MengHao666 commented Dec 12, 2024

jzhws commented Dec 12, 2024

MengHao666 commented Dec 16, 2024

jzhws commented Dec 16, 2024

MengHao666 commented Dec 16, 2024

jzhws commented Dec 16, 2024 via email

MengHao666 commented Dec 16, 2024

jzhws commented Dec 16, 2024 via email

MengHao666 commented Dec 16, 2024 •

edited

Loading

jzhws commented Dec 16, 2024 via email

jzhws commented Dec 16, 2024

MengHao666 commented Dec 17, 2024 •

edited

Loading

jzhws commented Dec 18, 2024

code to inference the video qulaity on my own video data #3

code to inference the video qulaity on my own video data #3

Comments

MengHao666 commented Dec 12, 2024

jzhws commented Dec 12, 2024

MengHao666 commented Dec 16, 2024

jzhws commented Dec 16, 2024

MengHao666 commented Dec 16, 2024

jzhws commented Dec 16, 2024 via email

MengHao666 commented Dec 16, 2024

jzhws commented Dec 16, 2024 via email

MengHao666 commented Dec 16, 2024 • edited Loading

jzhws commented Dec 16, 2024 via email

jzhws commented Dec 16, 2024

MengHao666 commented Dec 17, 2024 • edited Loading

jzhws commented Dec 18, 2024

MengHao666 commented Dec 16, 2024 •

edited

Loading

MengHao666 commented Dec 17, 2024 •

edited

Loading