Vector Search using Embeddings, FAISS and Product Quantization

Overview

This project implements an efficient similarity search system for lecture content using embeddings, FAISS and Product Quantization with custom index & KMeans implementations. It allows you to find similar lectures based on textual content, enabling quick retrieval and recommendation of lectures.

Features

Data Preprocessing: Load and preprocess lecture and query data (Generated by ChatGPT).
Embeddings: Compute and normalize embeddings using a specified model.
FAISS Indexing: Build and evaluate a FAISS index for efficient similarity search.
Performance Evaluation: Compute recall and queries per second (QPS) metrics.
Quantization: Implement Product Quantization (PQ) with custom index to reduce storage requirements.
Visualization: Plot performance metrics for analysis.

Installation

Clone the Repository

git clone https://github.com/bariscamli/Vector-Search-with-FAISS.git
cd Vector-Search-with-FAISS

Create a Virtual Environment (Optional but Recommended)

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Create a Virtual Environment (Optional but Recommended)
```
pip install -r requirements.txt
```

Data Preparation

Lecture Data: Place your lecture texts in a file specified by LECTURE_FILE in config.py. Each line should contain one lecture.

Query Data: Place your query texts in a file specified by QUERY_FILE in config.py. Each line should contain one query. Example format for lectures.txt:

Introduction to Machine Learning
Advanced Topics in Deep Learning
Statistical Methods in Data Science
...

Example format for queries.txt:

Basics of Neural Networks
Regression Analysis Techniques
Clustering Algorithms Overview
...

Configuration

All configurations are managed through the config.py file. Key parameters include:

File Paths
- LECTURE_FILE: Path to the lecture data file.
- QUERY_FILE: Path to the query data file.
Embedding Model
- EMBEDDING_MODEL_NAME: Name or path of the embedding model to use.
- BATCH_SIZE: Batch size for computing embeddings.
FAISS Parameters
- FAISS_EFSEARCH_VALUES: List of efSearch values for performance evaluation.
Quantization Parameters
- PQ_M: Number of sub-vector quantizers.
- PQ_NBITS: Number of bits per sub-vector.
- KMEANS_MAX_ITER: Maximum iterations for k-means during PQ training.

Usage

Run the main script to execute the full pipeline:

python main.py

What Happens When You Run main.py

Data Loading and Preprocessing
- Lectures and queries are loaded from the specified files.
- Text data is preprocessed (e.g., tokenization, cleaning).
Embedding Computation
- An embedding model is loaded as per EMBEDDING_MODEL_NAME.
- Embeddings for lectures and queries are computed and normalized.
Baseline Computation
- A baseline similarity matrix is computed using dot products.
- The baseline is used for performance comparison.
FAISS Index Building and Evaluation
- A FAISS index is built for the lecture embeddings.
- The index is evaluated over different efSearch values.
- Performance metrics (recall@1 and QPS) are computed.
Performance Visualization
- A plot is generated showing the trade-off between recall and QPS.
- The plot is displayed using Matplotlib.
Quantization
- A custom PQ index (CustomIndexPQ) is created.
- The index is trained and lectures are added to it.
Example Search
- An example search is performed using the PQ index.
- Results are logged, showing similar lectures to a given lecture.

Dependencies

Python 3.7 or higher
Required Python packages (installed via requirements.txt):
- numpy
- matplotlib
- faiss (Install via pip install faiss-cpu or faiss-gpu if you have a GPU)
- logging
- Embedding model libraries (e.g., transformers if using Hugging Face models)

License

This project is licensed under the MIT License. See the LICENSE

file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
images		images
quantization		quantization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data_processing.py		data_processing.py
embeddings.py		embeddings.py
faiss_index.py		faiss_index.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vector Search using Embeddings, FAISS and Product Quantization

Overview

Features

Table of Contents

Installation

Data Preparation

Configuration

Usage

What Happens When You Run main.py

Dependencies

License

About

Releases

Packages

Languages

License

bariscamli/Vector-Search-with-FAISS

Folders and files

Latest commit

History

Repository files navigation

Vector Search using Embeddings, FAISS and Product Quantization

Overview

Features

Table of Contents

Installation

Data Preparation

Configuration

Usage

What Happens When You Run main.py

Dependencies

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages