Skip to content

Commit

Permalink
PR: add audio data input
Browse files Browse the repository at this point in the history
feat: add audio data input
  • Loading branch information
NotShrirang authored Jan 4, 2025
2 parents a2c1b5a + 119ea0b commit c2bff8e
Show file tree
Hide file tree
Showing 9 changed files with 147 additions and 27 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ cython_debug/
# PyPI configuration file
.pypirc

audio/
images/
vectorstore/
trial.py
Expand Down
35 changes: 20 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
![GitHub repo size](https://img.shields.io/github/repo-size/NotShrirang/LoomRAG)
<a href="https://huggingface.co/spaces/NotShrirang/LoomRAG"><img src="https://img.shields.io/badge/Streamlit%20App-red?style=flat-rounded-square&logo=streamlit&labelColor=white"/></a>

This project implements a Multimodal Retrieval-Augmented Generation (RAG) system, named **LoomRAG**, that leverages OpenAI's CLIP model for neural cross-modal retrieval and semantic search. The system allows users to input text queries and retrieve both text and image responses seamlessly through vector embeddings. It features a comprehensive annotation interface for creating custom datasets and supports CLIP model fine-tuning with configurable parameters for domain-specific applications. The system also supports uploading images and PDFs for enhanced interaction and intelligent retrieval capabilities through a Streamlit-based interface.
This project implements a Multimodal Retrieval-Augmented Generation (RAG) system, named **LoomRAG**, that leverages **OpenAI's CLIP** model for neural cross-modal image retrieval and semantic search, and **OpenAI's Whisper** model for audio processing. The system allows users to input text queries, images, or audio to retrieve multimodal responses seamlessly through vector embeddings. It features a comprehensive annotation interface for creating custom datasets and supports CLIP model fine-tuning with configurable parameters for domain-specific applications. The system also supports uploading images, PDFs, and audio files (including real-time recording) for enhanced interaction and intelligent retrieval capabilities through a Streamlit-based interface.

Experience the project in action:

Expand All @@ -33,10 +33,11 @@ Experience the project in action:

- 🔄 **Cross-Modal Retrieval**: Search text to retrieve both text and image results using deep learning
- 🖼️ **Image-Based Search**: Search the database by uploading an image to find similar content
- 🧠 **Embedding-Based Search**: Uses OpenAI's CLIP model to align text and image embeddings in a shared latent space
- 🧠 **Embedding-Based Search**: Uses OpenAI's CLIP, Whisper and SentenceTransformer's Embedding Models for embedding the input data
- 🎯 **CLIP Fine-Tuning**: Supports custom model training with configurable parameters including test dataset split size, learning rate, optimizer, and weight decay
- 🔨 **Fine-Tuned Model Integration**: Seamlessly load and utilize fine-tuned CLIP models for enhanced search and retrieval
- 📤 **Upload Options**: Allows users to upload images and PDFs for AI-powered processing and retrieval
- 📤 **Upload Options**: Allows users to upload images, PDFs and audio files for AI-powered processing and retrieval
- 🎙️ **Audio Integration**: Upload audio files or record audio directly through the interface
- 🔗 **URL Integration**: Add images directly using URLs and scrape website data including text and images
- 🕷️ **Web Scraping**: Automatically extract and index content from websites for comprehensive search capabilities
- 🏷️ **Image Annotation**: Enables users to annotate uploaded images through an intuitive interface
Expand All @@ -45,6 +46,17 @@ Experience the project in action:

---

## 🗺️ Roadmap

- [x] Fine-tuning CLIP for domain-specific datasets
- [x] Image-based search and retrieval
- [x] Adding support for audeo modalities
- [ ] Adding support for video modalities
- [ ] Improving the re-ranking system for better contextual relevance
- [ ] Enhanced PDF parsing with semantic section segmentation

---

## 🏗️ Architecture Overview

1. **Data Indexing**:
Expand All @@ -56,13 +68,14 @@ Experience the project in action:
2. **Query Processing**:

- Text queries / image-based queries are converted into embeddings for semantic search
- Uploaded images and PDFs are processed and embedded for comparison
- The system performs a nearest neighbor search in the vector database to retrieve relevant text and images
- Uploaded images, audio files and PDFs are processed and embedded for comparison
- The system performs a nearest neighbor search in the vector database to retrieve relevant text, images, and audio

3. **Response Generation**:

- For text results: Optionally refined or augmented using a language model
- For image results: Directly returned or enhanced with image captions
- For audio results: Returned with relevant metadata and transcriptions where applicable
- For PDFs: Extracts text content and provides relevant sections

4. **Image Annotation**:
Expand Down Expand Up @@ -108,6 +121,7 @@ Experience the project in action:
- Access the interface in your browser to:
- Submit natural language queries
- Upload images or PDFs to retrieve contextually relevant results
- Upload or record audio files
- Add images using URLs
- Scrape and index website content
- Search using uploaded images
Expand All @@ -132,16 +146,6 @@ Experience the project in action:

---

## 🗺️ Roadmap

- [x] Fine-tuning CLIP for domain-specific datasets
- [x] Image-based search and retrieval
- [ ] Adding support for audio and video modalities
- [ ] Improving the re-ranking system for better contextual relevance
- [ ] Enhanced PDF parsing with semantic section segmentation

---

## 🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request for any feature requests or bug fixes.
Expand All @@ -157,5 +161,6 @@ This project is licensed under the Apache-2.0 License. See the [LICENSE](LICENSE
## 🙏 Acknowledgments

- [OpenAI CLIP](https://openai.com/research/clip)
- [OpenAI Whisper](https://github.com/openai/whisper)
- [FAISS](https://github.com/facebookresearch/faiss)
- [Hugging Face](https://huggingface.co/)
7 changes: 4 additions & 3 deletions app.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from data_search import data_search_page
from data_annotations import data_annotation_page
from model_finetuning import model_finetuning_page
from utils import load_clip_model, load_text_embedding_model
from utils import load_clip_model, load_text_embedding_model, load_whisper_model

os.environ['KMP_DUPLICATE_LIB_OK']='True'

Expand All @@ -21,6 +21,7 @@
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, preprocess = load_clip_model()
text_embedding_model = load_text_embedding_model()
whisper_model = load_whisper_model()
os.makedirs("annotations/", exist_ok=True)
os.makedirs("images/", exist_ok=True)

Expand All @@ -34,9 +35,9 @@
)

if page == "Data Upload":
data_upload_page.data_upload(clip_model, preprocess, text_embedding_model)
data_upload_page.data_upload(clip_model, preprocess, text_embedding_model, whisper_model)
if page == "Data Search":
data_search_page.data_search(clip_model, preprocess, text_embedding_model, device)
data_search_page.data_search(clip_model, preprocess, text_embedding_model, whisper_model, device)
if page == "Data Annotation":
data_annotation_page.data_annotations()
if page == "Model Fine-Tuning":
Expand Down
28 changes: 23 additions & 5 deletions data_search/data_search_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@
import sys
import torch
from vectordb import search_image_index, search_text_index, search_image_index_with_image, search_text_index_with_image
from utils import load_image_index, load_text_index, get_local_files
from utils import load_image_index, load_text_index, load_audio_index, get_local_files
from data_search import adapter_utils

sys.path.append(os.path.dirname(os.path.abspath(__file__)))


def data_search(clip_model, preprocess, text_embedding_model, device):
def data_search(clip_model, preprocess, text_embedding_model, whisper_model, device):

@st.cache_resource
def load_finetuned_model(file_name):
Expand Down Expand Up @@ -68,13 +68,17 @@ def load_adapter():
image_index, image_data = load_image_index()
if os.path.exists("./vectorstore/text_index.index"):
text_index, text_data = load_text_index()
if os.path.exists("./vectorstore/audio_index.index"):
audio_index, audio_data = load_audio_index()
with torch.no_grad():
if not os.path.exists("./vectorstore/image_data.csv"):
st.warning("No Image Index Found. So not searching for images.")
image_index = None
if not os.path.exists("./vectorstore/text_data.csv"):
st.warning("No Text Index Found. So not searching for text.")
text_index = None
if not os.path.exists("./vectorstore/audio_data.csv"):
st.warning("No Audio Index Found. So not searching for audio.")
if image_input:
image = Image.open(image_input)
image = preprocess(image).unsqueeze(0).to(device)
Expand All @@ -85,14 +89,18 @@ def load_adapter():
image_indices = search_image_index_with_image(image_features, image_index, clip_model, k=3)
if text_index is not None:
text_indices = search_text_index_with_image(adapted_text_embeddings, text_index, text_embedding_model, k=3)
if audio_index is not None:
audio_indices = search_text_index_with_image(adapted_text_embeddings, audio_index, text_embedding_model, k=3)
else:
if image_index is not None:
image_indices = search_image_index(text_input, image_index, clip_model, k=3)
if text_index is not None:
text_indices = search_text_index(text_input, text_index, text_embedding_model, k=3)
if not image_index and not text_index:
if audio_index is not None:
audio_indices = search_text_index(text_input, audio_index, text_embedding_model, k=3)
if not image_index and not text_index and not audio_index:
st.error("No Data Found! Please add data to the database.")
st.subheader("Top 3 Results")
st.subheader("Image Results")
cols = st.columns(3)
for i in range(3):
with cols[i]:
Expand All @@ -106,9 +114,19 @@ def load_adapter():
cosine_similarity = torch.cosine_similarity(image_features, text_features)
st.write(f"Similarity: {cosine_similarity.item() * 100:.2f}%")
st.image(image_path)
st.subheader("Text Results")
cols = st.columns(3)
for i in range(3):
with cols[i]:
if text_index:
text_content = text_data['content'].iloc[text_indices[0][i]]
st.write(text_content)
st.write(text_content)
st.subheader("Audio Results")
cols = st.columns(3)
for i in range(3):
with cols[i]:
if audio_index:
audio_path = audio_data['path'].iloc[audio_indices[0][i]]
audio_content = audio_data['content'].iloc[audio_indices[0][i]]
st.audio(audio_path)
st.write(f"_{audio_content}_")
8 changes: 5 additions & 3 deletions data_upload/data_upload_page.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@
import streamlit as st
import sys

from data_upload.input_sources_utils import image_util, pdf_util, website_util
from data_upload.input_sources_utils import image_util, pdf_util, website_util, audio_util

sys.path.append(os.path.dirname(os.path.abspath(__file__)))


def data_upload(clip_model, preprocess, text_embedding_model):
def data_upload(clip_model, preprocess, text_embedding_model, whisper_model):
st.title("Data Upload")
st.warning("Please note that this is a public application. Make sure you are not uploading any sensitive data.")
upload_choice = st.selectbox(options=["Upload Image", "Add Image from URL / Link", "Upload PDF", "Website Link"], label="Select Upload Type")
upload_choice = st.selectbox(options=["Upload Image", "Add Image from URL / Link", "Upload PDF", "Website Link", "Audio Recording"], label="Select Upload Type")
if upload_choice == "Upload Image":
image_util.upload_image(clip_model, preprocess)
elif upload_choice == "Add Image from URL / Link":
Expand All @@ -19,3 +19,5 @@ def data_upload(clip_model, preprocess, text_embedding_model):
pdf_util.upload_pdf(clip_model, preprocess, text_embedding_model)
elif upload_choice == "Website Link":
website_util.data_from_website(clip_model, preprocess, text_embedding_model)
elif upload_choice == "Audio Recording":
audio_util.upload_audio(whisper_model, text_embedding_model)
29 changes: 29 additions & 0 deletions data_upload/input_sources_utils/audio_util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
import os
import requests
import streamlit as st
import sys
import whisper

from vectordb import add_image_to_index, add_pdf_to_index, add_audio_to_index

sys.path.append(os.path.dirname(os.path.abspath(__file__)))

def upload_audio(whisper_model, text_embedding_model):
st.title("Upload Audio")
recorded_audio = st.audio_input("Record Audio")
st.write("---")
uploaded_audios = st.file_uploader("Upload Audio", type=["mp3", "wav"], accept_multiple_files=True)
if recorded_audio:
st.audio(recorded_audio)
if st.button("Add Audio"):
add_audio_to_index(recorded_audio, whisper_model, text_embedding_model)
st.success("Audio Added to Database")
if uploaded_audios:
for audio in uploaded_audios:
st.audio(audio)
if st.button("Add Audio"):
progress_bar = st.progress(0, f"Adding Audio... | 0/{len(uploaded_audios)}")
for count, audio in enumerate(uploaded_audios):
add_audio_to_index(audio, whisper_model, text_embedding_model)
progress_bar.progress((count + 1) / len(uploaded_audios), f"Adding Audio... | {count + 1}/{len(uploaded_audios)}")
st.success("Audio Added to Database")
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ fonttools==4.55.3
frozenlist==1.5.0
fsspec==2024.9.0
ftfy==6.3.1
future==1.0.0
gitdb==4.0.11
GitPython==3.1.43
greenlet==3.1.1
Expand All @@ -48,20 +49,24 @@ langchain-core==0.3.28
langchain-experimental==0.3.4
langchain-text-splitters==0.3.4
langsmith==0.1.147
llvmlite==0.43.0
lxml==5.1.0
markdown-it-py==3.0.0
MarkupSafe==3.0.2
marshmallow==3.23.2
matplotlib==3.10.0
mdurl==0.1.2
more-itertools==10.5.0
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
mypy-extensions==1.0.0
narwhals==1.19.1
networkx==3.4.2
numba==0.60.0
numpy==1.26.4
open_clip_torch==2.29.0
openai-whisper==20240930
orjson==3.10.12
packaging==24.2
pandas==2.2.3
Expand Down Expand Up @@ -100,6 +105,7 @@ streamlit-option-menu==0.4.0
sympy==1.13.1
tenacity==8.5.0
threadpoolctl==3.5.0
tiktoken==0.8.0
timm==1.0.12
tokenizers==0.21.0
toml==0.10.2
Expand Down
11 changes: 11 additions & 0 deletions utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from sentence_transformers import SentenceTransformer
import streamlit as st
import torch
import whisper

device = "cuda" if torch.cuda.is_available() else "cpu"

Expand All @@ -19,6 +20,11 @@ def load_text_embedding_model():
model = SentenceTransformer("all-MiniLM-L6-v2")
return model

@st.cache_resource
def load_whisper_model():
model = whisper.load_model("small")
return model

def load_image_index():
index = faiss.read_index('./vectorstore/image_index.index')
data = pd.read_csv("./vectorstore/image_data.csv")
Expand All @@ -29,6 +35,11 @@ def load_text_index():
data = pd.read_csv("./vectorstore/text_data.csv")
return index, data

def load_audio_index():
index = faiss.read_index('./vectorstore/audio_index.index')
data = pd.read_csv("./vectorstore/audio_data.csv")
return index, data

def cosine_similarity(a, b):
return torch.cosine_similarity(a, b)

Expand Down
Loading

0 comments on commit c2bff8e

Please sign in to comment.