Is it possible to produce a single CLIP embedding for a JSON object with multiple accompanying images? #1014

mdear · 2024-12-31T02:57:05Z

mdear
Dec 31, 2024

Hello,

As per https://ai.stackexchange.com/questions/40753/how-to-generate-original-training-videos-based-on-existing-videoset , I have created an AI factory to pull out multimodal summaries (derived from transcribed subtitles and keyframe analysis) and e-learning content from series of vocational training videos in the wheelchair custom seating vertical. Each text object has a set of timestamps that identify sets of keyframes that contain corresponding visual data. Each text object is expressed in multiple required languages (language will be my PineCone namespace).

I'm very new to the RAG/embeddings/vector DB game and I had some questions:

Is it possible to create a single embedding that combines the semantics of a structured object containing a textual description along with a small set of keyframes, or should I create multiple embeddings with identical metadata before upserting into a vector database such as PineCone? I saw some examples online of averaging multiple embeddings but I'm concerned doing this will toss away crucial detail from the embeddings that may be required.

The purpose is to have a user query consisting of text and/or image embody a semantic search that returns objects whose embeddings are returned as a set of top N nearest neighbors.

Metadata upserted along with the embedding is as follows : series name, video name, applicable timestamps, object type, and a reference to the actual JSON object used to create the embedding.

I'm considering using CLIP-ViT-YH-14-frozen-xlm-roberta-large-laion5B-s13B-b90k

I'm unsure if, for example, the embedding for an image of a wheelchair footrest would be close in vector space to an embedding of a text based object that discusses specific details of how to create a wheelchair footrest.

Am I on the right track here, or are there any crucial details I'm missing?

As you may have surmised, I'm trying to avoid fine-tuning, mostly because I lack the experience and budget, but also because I'm trying to exhaust all simpler alternatives that may be in the reach of a small company such as mine and my client's and the possibility of catastrophic forgetting and over/underfitting are risks that I'm simply not able to assume at present.

Many thanks. Respect to the community !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to produce a single CLIP embedding for a JSON object with multiple accompanying images? #1014

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Is it possible to produce a single CLIP embedding for a JSON object with multiple accompanying images? #1014

mdear Dec 31, 2024

Replies: 0 comments

mdear
Dec 31, 2024