You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As per https://ai.stackexchange.com/questions/40753/how-to-generate-original-training-videos-based-on-existing-videoset , I have created an AI factory to pull out multimodal summaries (derived from transcribed subtitles and keyframe analysis) and e-learning content from series of vocational training videos in the wheelchair custom seating vertical. Each text object has a set of timestamps that identify sets of keyframes that contain corresponding visual data. Each text object is expressed in multiple required languages (language will be my PineCone namespace).
I'm very new to the RAG/embeddings/vector DB game and I had some questions:
Is it possible to create a single embedding that combines the semantics of a structured object containing a textual description along with a small set of keyframes, or should I create multiple embeddings with identical metadata before upserting into a vector database such as PineCone? I saw some examples online of averaging multiple embeddings but I'm concerned doing this will toss away crucial detail from the embeddings that may be required.
The purpose is to have a user query consisting of text and/or image embody a semantic search that returns objects whose embeddings are returned as a set of top N nearest neighbors.
Metadata upserted along with the embedding is as follows : series name, video name, applicable timestamps, object type, and a reference to the actual JSON object used to create the embedding.
I'm considering using CLIP-ViT-YH-14-frozen-xlm-roberta-large-laion5B-s13B-b90k
I'm unsure if, for example, the embedding for an image of a wheelchair footrest would be close in vector space to an embedding of a text based object that discusses specific details of how to create a wheelchair footrest.
Am I on the right track here, or are there any crucial details I'm missing?
As you may have surmised, I'm trying to avoid fine-tuning, mostly because I lack the experience and budget, but also because I'm trying to exhaust all simpler alternatives that may be in the reach of a small company such as mine and my client's and the possibility of catastrophic forgetting and over/underfitting are risks that I'm simply not able to assume at present.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
As per https://ai.stackexchange.com/questions/40753/how-to-generate-original-training-videos-based-on-existing-videoset , I have created an AI factory to pull out multimodal summaries (derived from transcribed subtitles and keyframe analysis) and e-learning content from series of vocational training videos in the wheelchair custom seating vertical. Each text object has a set of timestamps that identify sets of keyframes that contain corresponding visual data. Each text object is expressed in multiple required languages (language will be my PineCone namespace).
I'm very new to the RAG/embeddings/vector DB game and I had some questions:
Is it possible to create a single embedding that combines the semantics of a structured object containing a textual description along with a small set of keyframes, or should I create multiple embeddings with identical metadata before upserting into a vector database such as PineCone? I saw some examples online of averaging multiple embeddings but I'm concerned doing this will toss away crucial detail from the embeddings that may be required.
The purpose is to have a user query consisting of text and/or image embody a semantic search that returns objects whose embeddings are returned as a set of top N nearest neighbors.
Metadata upserted along with the embedding is as follows : series name, video name, applicable timestamps, object type, and a reference to the actual JSON object used to create the embedding.
I'm considering using CLIP-ViT-YH-14-frozen-xlm-roberta-large-laion5B-s13B-b90k
I'm unsure if, for example, the embedding for an image of a wheelchair footrest would be close in vector space to an embedding of a text based object that discusses specific details of how to create a wheelchair footrest.
Am I on the right track here, or are there any crucial details I'm missing?
As you may have surmised, I'm trying to avoid fine-tuning, mostly because I lack the experience and budget, but also because I'm trying to exhaust all simpler alternatives that may be in the reach of a small company such as mine and my client's and the possibility of catastrophic forgetting and over/underfitting are risks that I'm simply not able to assume at present.
Many thanks. Respect to the community !
Beta Was this translation helpful? Give feedback.
All reactions