This project implements an efficient similarity search system for lecture content using embeddings, FAISS and Product Quantization with custom index & KMeans implementations. It allows you to find similar lectures based on textual content, enabling quick retrieval and recommendation of lectures.
- Data Preprocessing: Load and preprocess lecture and query data (Generated by ChatGPT).
- Embeddings: Compute and normalize embeddings using a specified model.
- FAISS Indexing: Build and evaluate a FAISS index for efficient similarity search.
- Performance Evaluation: Compute recall and queries per second (QPS) metrics.
- Quantization: Implement Product Quantization (PQ) with custom index to reduce storage requirements.
- Visualization: Plot performance metrics for analysis.
-
Clone the Repository
git clone https://github.com/bariscamli/Vector-Search-with-FAISS.git cd Vector-Search-with-FAISS
-
Create a Virtual Environment (Optional but Recommended)
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Create a Virtual Environment (Optional but Recommended)
pip install -r requirements.txt
-
Lecture Data: Place your lecture texts in a file specified by
LECTURE_FILE
in config.py. Each line should contain one lecture. -
Query Data: Place your query texts in a file specified by
QUERY_FILE
in config.py. Each line should contain one query. Example format for lectures.txt:Introduction to Machine Learning Advanced Topics in Deep Learning Statistical Methods in Data Science ...
Example format for queries.txt:
Basics of Neural Networks Regression Analysis Techniques Clustering Algorithms Overview ...
All configurations are managed through the config.py file. Key parameters include:
File Paths
- LECTURE_FILE: Path to the lecture data file.
- QUERY_FILE: Path to the query data file.
Embedding Model
- EMBEDDING_MODEL_NAME: Name or path of the embedding model to use.
- BATCH_SIZE: Batch size for computing embeddings.
FAISS Parameters
- FAISS_EFSEARCH_VALUES: List of efSearch values for performance evaluation.
Quantization Parameters
- PQ_M: Number of sub-vector quantizers.
- PQ_NBITS: Number of bits per sub-vector.
- KMEANS_MAX_ITER: Maximum iterations for k-means during PQ training.
Run the main script to execute the full pipeline:
python main.py
What Happens When You Run main.py
-
Data Loading and Preprocessing
- Lectures and queries are loaded from the specified files.
- Text data is preprocessed (e.g., tokenization, cleaning).
-
Embedding Computation
- An embedding model is loaded as per
EMBEDDING_MODEL_NAME
. - Embeddings for lectures and queries are computed and normalized.
- An embedding model is loaded as per
-
Baseline Computation
- A baseline similarity matrix is computed using dot products.
- The baseline is used for performance comparison.
-
FAISS Index Building and Evaluation
- A FAISS index is built for the lecture embeddings.
- The index is evaluated over different
efSearch
values. - Performance metrics (recall@1 and QPS) are computed.
-
Performance Visualization
-
Quantization
- A custom PQ index (
CustomIndexPQ
) is created. - The index is trained and lectures are added to it.
- A custom PQ index (
-
Example Search
- Python 3.7 or higher
- Required Python packages (installed via requirements.txt):
numpy
matplotlib
faiss
(Install viapip install faiss-cpu
orfaiss-gpu
if you have a GPU)logging
- Embedding model libraries (e.g.,
transformers
if using Hugging Face models)