Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code to inference the video qulaity on my own video data #3

Open
MengHao666 opened this issue Dec 12, 2024 · 12 comments
Open

code to inference the video qulaity on my own video data #3

MengHao666 opened this issue Dec 12, 2024 · 12 comments

Comments

@MengHao666
Copy link

Thanks for such project.
I can not wait to test the power of your model which claims that it is excellent!
But it seems "python ./llava/eval/model_score_UGC.py" dose not work, I don't know how to get the model pth file.
the model path in your code is "/DATA/DATA2/jzh/video_benchmark/LLaVA-NeXT-main/llava-ov-chat-qwen2_slowfast_base", I can not download it.

@jzhws
Copy link
Collaborator

jzhws commented Dec 12, 2024

change the path to "q-future/VQA-UGC-Scorer_qwen" and make sure your network can connect to the huggingface(otherwise, use the mirror)and the model will be automaticly downloaded and loaded.

@MengHao666
Copy link
Author

I try to make a simple demo.py to test on my own video data. However, it seems current codebase couldn't run at all.

the demo.py

import argparse

import os
import sys
sys.path.append('./')
sys.path.append('../')
import json
from tqdm import tqdm
import numpy as np
import shortuuid
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
os.environ['HF_ENDPOINT']= 'https://hf-mirror.com'
import torch
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from torchvision import transforms
from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from typing import Dict, Optional, Sequence, List
import transformers
import re
from collections import defaultdict
from PIL import Image
import math
from scipy.stats import spearmanr, pearsonr

def load_video(video_file, video_fps):
    from decord import VideoReader,cpu
    vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
    # frame_idx=[]
    # for ii in range(len(vr)//round(vr.get_avg_fps())):
    #     total_frame_num = round(vr.get_avg_fps())
    #     avg_fps = round(vr.get_avg_fps() / video_fps)
    #     # total_frame_num=len(vr)//avg_fps*avg_fps
    #     frame_idx.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    # total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    # avg_fps = round(vr.get_avg_fps() / video_fps)
    # # total_frame_num=len(vr)//avg_fps*avg_fps
    # frame_idx.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()
    frames = vr.get_batch(list(range(len(vr)))).asnumpy()
    frame_idx1 = []
    video_fps=1
    for ii in range(len(vr)//round(vr.get_avg_fps())):
        # print(video_file)
        total_frame_num = round(vr.get_avg_fps())
        avg_fps = round(vr.get_avg_fps() / video_fps)
        # total_frame_num=len(vr)//avg_fps*avg_fps
        frame_idx1.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    avg_fps = round(vr.get_avg_fps() / video_fps)
    # total_frame_num=len(vr)//avg_fps*avg_fps
    frame_idx1.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()

    return [Image.fromarray(frames[i]) for i in range(len((vr)))],frame_idx1
    # return frame_idx,len(frame_idx)/video_fps


def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
    roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}

    im_start, im_end = tokenizer.additional_special_tokens_ids
    nl_tokens = tokenizer("\n").input_ids
    _system = tokenizer("system").input_ids + nl_tokens
    _user = tokenizer("user").input_ids + nl_tokens
    _assistant = tokenizer("assistant").input_ids + nl_tokens

    # Apply prompt templates
    input_ids, targets = [], []

    source = sources
    # if roles[source[0]["from"]] != roles["human"]:
    #     source = source[1:]

    input_id, target = [], []
    system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
    input_id += system
    target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens
    assert len(input_id) == len(target)
    for j, sentence in enumerate(source):
        if j==0:
            role = "<|im_start|>user"
        else:
            role = "<|im_start|>assistant"
        if has_image and sentence is not None and "<image>" in sentence:
            num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence))
            texts = sentence.split('<image>')
            _input_id = tokenizer(role).input_ids + nl_tokens
            for i,text in enumerate(texts):
                _input_id += tokenizer(text).input_ids
                if i<len(texts)-1:
                    _input_id += [IMAGE_TOKEN_INDEX]
            _input_id += [im_end] + nl_tokens
            assert sum([i==IMAGE_TOKEN_INDEX for i in _input_id])==num_image
        else:
            if sentence["value"] is None:
                _input_id = tokenizer(role).input_ids + nl_tokens
            else:
                _input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
        input_id += _input_id
        if role == "<|im_start|>user":
            _target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens
        elif role == "<|im_start|>assistant":
            _target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens
        else:
            raise NotImplementedError
        target += _target

    input_ids.append(input_id)
    targets.append(target)
    input_ids = torch.tensor(input_ids, dtype=torch.long)
    targets = torch.tensor(targets, dtype=torch.long)
    return input_ids


def eval_model(args):

    # Model
    disable_torch_init()
    model_path = os.path.expanduser(args.model_path)
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name)

    inp = "The key frames of this video are:" + "\n" + DEFAULT_IMAGE_TOKEN + ". And the motion feature of the video is" + "\n" + DEFAULT_IMAGE_TOKEN + ". How would you rate the quality of this video?"
    image, frame_idx = load_video(args.video_path, 24)
    cur_prompt = args.extra_prompt + inp
    # print(cur_prompt)
    conv = conv_templates[args.conv_mode].copy()
    conv.append_message(conv.roles[0], cur_prompt)
    conv.append_message(conv.roles[1], "The quality of the video is")
    prompt = conv.get_prompt()

    input_ids = preprocess_qwen([cur_prompt, {'from': 'gpt', 'value': "The quality of the video is"}], tokenizer,
                                has_image=True).to(args.device)

    image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[
        'pixel_values']
    image_tensor1 = \
        image_processor.preprocess([image[frame_idx[i]] for i in range(len(frame_idx))], return_tensors='pt')[
            'pixel_values']

    image_tensors = [[image_tensor[:image_tensor.shape[0] // 4 * 4].half().to(args.device)], [image_tensor1.half().to(args.device)]]

    # stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    # keywords = [stop_str]
    # stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    # output_logits = model(input_ids,
    #                       images=image_tensors)["logits"][:, -3]

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, default="q-future/VQA-UGC-Scorer_qwen")
    parser.add_argument("--video-path", type=str, default="/Users/momiao/Desktop/Projects/video_samples/good/0000.mp4")
    parser.add_argument("--device", type=str, default="cuda")
    parser.add_argument("--model-base", type=str, default=None)
    parser.add_argument("--image-folder", type=str, default="")
    parser.add_argument("--extra-prompt", type=str, default="")
    parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
    parser.add_argument("--answers-file", type=str, default="answer.jsonl")
    parser.add_argument("--conv-mode", type=str, default="llava_v1")
    parser.add_argument("--num-chunks", type=int, default=1)
    parser.add_argument("--chunk-idx", type=int, default=0)
    parser.add_argument("--temperature", type=float, default=0.2)
    parser.add_argument("--top_p", type=float, default=None)
    parser.add_argument("--num_beams", type=int, default=1)
    parser.add_argument("--test_size", type=int, default=10000000)
    args = parser.parse_args()

    eval_model(args)

the error like following

Traceback (most recent call last):
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 176, in <module>
    eval_model(args)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 143, in eval_model
    image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[
AttributeError: 'NoneType' object has no attribute 'preprocess'

@jzhws
Copy link
Collaborator

jzhws commented Dec 16, 2024

Sorry,I have overlooked one crucial point in my code, the name of the model weight folder should contain the string “llava” (see line 279 in llava\model\builder.py),so please download the latest version of the model on huggingface or simply change the name of your folder of the model weight.

@MengHao666
Copy link
Author

Sorry,I have overlooked one crucial point in my code, the name of the model weight folder should contain the string “llava” (see line 279 in llava\model\builder.py),so please download the latest version of the model on huggingface or simply change the name of your folder of the model weight.

thanks for reply.
can i replace "if "llava" in model_name.lower() or is_multimodal:" with "if "VQA-UGC-Scorer" in model_name:" for simple test

@jzhws
Copy link
Collaborator

jzhws commented Dec 16, 2024 via email

@MengHao666
Copy link
Author

thanks, it fix the bug before. However, another bug was triggered. It seems like current code base might be developed right now. Hope it could be suitable for common users to run one day.

Traceback (most recent call last):
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 178, in <module>
    eval_model(args)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/eval/demo.py", line 155, in eval_model
    output_logits = model(input_ids,
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/language_model/llava_qwen.py", line 84, in forward
    (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 262, in prepare_inputs_labels_for_multimodal
    encoded_image_features,encoded_slowfast_features = self.encode_images(images)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/llava_arch.py", line 192, in encode_images
    image_features = self.get_model().get_vision_tower()(images[1])
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/Users/momiao/Desktop/Projects/Visual-Question-Answering-for-Video-Quality-Assessment/VQA_main/llava/model/multimodal_encoder/siglip_encoder.py", line 586, in forward
    image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/hooks.py", line 364, in pre_forward
    return send_to_device(args, self.execution_device), send_to_device(
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 175, in send_to_device
    return honor_type(
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 82, in honor_type
    return type(obj)(generator)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 176, in <genexpr>
    tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
  File "/Users/momiao/miniforge3/envs/VQA/lib/python3.10/site-packages/accelerate/utils/operations.py", line 156, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

@jzhws
Copy link
Collaborator

jzhws commented Dec 16, 2024 via email

@MengHao666
Copy link
Author

MengHao666 commented Dec 16, 2024

I tested it on my Mac M1 before with cpu as device. As you suggest, I figured out the error before after using V100 GPU on ubuntu system. However, I met more errors or bugs. Considering the under-develop state, the VQA and q_align might not be suitable for large-scale inference on custom data now.

Thanks the authors the same. I just post my ugly code , it is loacte in the same folder as "./llava/eval/model_score_UGC.py". my inference code is "python ./llava/eval/demo.py --video-path ./0000.mp4 --device cuda", the demo.py is as following. Also notice I also modify the llava\model\builder.py as author suggested.

import argparse

import os
import sys
sys.path.append('./')
sys.path.append('../')
import json
from tqdm import tqdm
import numpy as np
import shortuuid
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
os.environ['HF_ENDPOINT']= 'https://hf-mirror.com'
import torch
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from torchvision import transforms
from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from typing import Dict, Optional, Sequence, List
import transformers
import re
from collections import defaultdict
from PIL import Image
import math
from scipy.stats import spearmanr, pearsonr

def load_video(video_file, video_fps):
    from decord import VideoReader,cpu
    vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
    # frame_idx=[]
    # for ii in range(len(vr)//round(vr.get_avg_fps())):
    #     total_frame_num = round(vr.get_avg_fps())
    #     avg_fps = round(vr.get_avg_fps() / video_fps)
    #     # total_frame_num=len(vr)//avg_fps*avg_fps
    #     frame_idx.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    # total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    # avg_fps = round(vr.get_avg_fps() / video_fps)
    # # total_frame_num=len(vr)//avg_fps*avg_fps
    # frame_idx.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()
    frames = vr.get_batch(list(range(len(vr)))).asnumpy()
    frame_idx1 = []
    video_fps=1
    for ii in range(len(vr)//round(vr.get_avg_fps())):
        # print(video_file)
        total_frame_num = round(vr.get_avg_fps())
        avg_fps = round(vr.get_avg_fps() / video_fps)
        # total_frame_num=len(vr)//avg_fps*avg_fps
        frame_idx1.extend([i for i in range(ii*round(vr.get_avg_fps()), (ii+1)*round(vr.get_avg_fps()), avg_fps)])
    total_frame_num = len(vr)-(len(vr)//round(vr.get_avg_fps())*round(vr.get_avg_fps()))
    avg_fps = round(vr.get_avg_fps() / video_fps)
    # total_frame_num=len(vr)//avg_fps*avg_fps
    frame_idx1.extend([i for i in range((ii+1)*round(vr.get_avg_fps()), len(vr), avg_fps)])
    # if len(frame_idx) > 200:
    #             uniform_sampled_frames = np.linspace(0, total_frame_num - 1, 100, dtype=int)
    #             frame_idx = uniform_sampled_frames.tolist()

    return [Image.fromarray(frames[i]) for i in range(len((vr)))],frame_idx1
    # return frame_idx,len(frame_idx)/video_fps


def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
    roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}

    im_start, im_end = tokenizer.additional_special_tokens_ids
    nl_tokens = tokenizer("\n").input_ids
    _system = tokenizer("system").input_ids + nl_tokens
    _user = tokenizer("user").input_ids + nl_tokens
    _assistant = tokenizer("assistant").input_ids + nl_tokens

    # Apply prompt templates
    input_ids, targets = [], []

    source = sources
    # if roles[source[0]["from"]] != roles["human"]:
    #     source = source[1:]

    input_id, target = [], []
    system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
    input_id += system
    target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens
    assert len(input_id) == len(target)
    for j, sentence in enumerate(source):
        if j==0:
            role = "<|im_start|>user"
        else:
            role = "<|im_start|>assistant"
        if has_image and sentence is not None and "<image>" in sentence:
            num_image = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence))
            texts = sentence.split('<image>')
            _input_id = tokenizer(role).input_ids + nl_tokens
            for i,text in enumerate(texts):
                _input_id += tokenizer(text).input_ids
                if i<len(texts)-1:
                    _input_id += [IMAGE_TOKEN_INDEX]
            _input_id += [im_end] + nl_tokens
            assert sum([i==IMAGE_TOKEN_INDEX for i in _input_id])==num_image
        else:
            if sentence["value"] is None:
                _input_id = tokenizer(role).input_ids + nl_tokens
            else:
                _input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
        input_id += _input_id
        if role == "<|im_start|>user":
            _target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens
        elif role == "<|im_start|>assistant":
            _target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens
        else:
            raise NotImplementedError
        target += _target

    input_ids.append(input_id)
    targets.append(target)
    input_ids = torch.tensor(input_ids, dtype=torch.long)
    targets = torch.tensor(targets, dtype=torch.long)
    return input_ids


def eval_model(args):

    # Model
    disable_torch_init()
    model_path = os.path.expanduser(args.model_path)
    model_name = get_model_name_from_path(model_path)
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name, device_map=args.device)

    inp = "The key frames of this video are:" + "\n" + DEFAULT_IMAGE_TOKEN + ". And the motion feature of the video is" + "\n" + DEFAULT_IMAGE_TOKEN + ". How would you rate the quality of this video?"
    print("exists = ", os.path.exists(args.video_path))
    image, frame_idx = load_video(args.video_path, 24)
    cur_prompt = args.extra_prompt + inp
    # print(cur_prompt)
    conv = conv_templates[args.conv_mode].copy()
    conv.append_message(conv.roles[0], cur_prompt)
    conv.append_message(conv.roles[1], "The quality of the video is")
    prompt = conv.get_prompt()

    input_ids = preprocess_qwen([cur_prompt, {'from': 'gpt', 'value': "The quality of the video is"}], tokenizer,
                                has_image=True).to(args.device)

    image_tensor = image_processor.preprocess(image[:len(image) // 8 * 8][0::8], return_tensors='pt')[
        'pixel_values']
    image_tensor1 = \
        image_processor.preprocess([image[frame_idx[i]] for i in range(len(frame_idx))], return_tensors='pt')[
            'pixel_values']

    image_tensors = [[image_tensor[:image_tensor.shape[0] // 4 * 4].half().to(args.device)], [image_tensor1.half().to(args.device)]]

    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
    output_logits = model(input_ids,
                          images=image_tensors)["logits"][:, -3]
    print(output_logits)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-path", type=str, default="q-future/VQA-UGC-Scorer_qwen")
    parser.add_argument("--video-path", type=str, default="/Users/momiao/Desktop/Projects/video_samples/good/0000.mp4")
    parser.add_argument("--device", type=str, default="cuda")
    parser.add_argument("--model-base", type=str, default=None)
    parser.add_argument("--image-folder", type=str, default="")
    parser.add_argument("--extra-prompt", type=str, default="")
    parser.add_argument("--question-file", type=str, default="tables/question.jsonl")
    parser.add_argument("--answers-file", type=str, default="answer.jsonl")
    parser.add_argument("--conv-mode", type=str, default="llava_v1")
    parser.add_argument("--num-chunks", type=int, default=1)
    parser.add_argument("--chunk-idx", type=int, default=0)
    parser.add_argument("--temperature", type=float, default=0.2)
    parser.add_argument("--top_p", type=float, default=None)
    parser.add_argument("--num_beams", type=int, default=1)
    parser.add_argument("--test_size", type=int, default=10000000)
    args = parser.parse_args()

    eval_model(args)

@jzhws
Copy link
Collaborator

jzhws commented Dec 16, 2024 via email

@jzhws
Copy link
Collaborator

jzhws commented Dec 16, 2024

Additionally, if you are from China, then feel free to ask in Chinese to facilitate the solution of the problems.

@MengHao666
Copy link
Author

MengHao666 commented Dec 17, 2024

您好,

您是否能提供一个基于Visual-Question-Answering-for-Video-Quality-Assessment项目的能够评估in-the-wild视频的demo.py。我是视频质量评估领域内的新手,魔改您的代码使得它能正常运行会很有挑战。
我执行上面的代码,环境是ubuntu,4块V100,如果使用cuda会OOM;如果使用cpu则会报错“RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'”,应该是内部要把模型和数据再处理成fp16才能跑通。

另外一个Q-align的项目,看起来star数很多,有更多人关注,可能更适合我的场景需要。但是huggingface space上的demo不能正常运行,严格执行repo中环境配置和官方代码也不能跑通。不知道您是否会考虑优先维护更新一下Q-align的项目呢?

目前我想做的是过滤数亿条视频数据中低质量的,我希望能实际测试当下在该领域在in-the-wild video上最先进的SOTA模型。看腾讯混元文生视频使用的是您团队之前做的Dover,那个参数量小可以跑通。本来想看看Q-align和Visual-Question-Answering-for-Video-Quality-Assessment有没有更好的效果,但现在看起来对新手挑战比较大。

希望得到您的建议和帮助,多谢您~

@jzhws
Copy link
Collaborator

jzhws commented Dec 18, 2024

我们会尽快对代码进行润色修改并尽量满足快速上手的要求,感谢支持。另外能否提供一下您评估的视频时长和帧率,正常情况下单段十秒左右视频打分只需要不到20G显存,不会出现爆显存的情况

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants