Skip to content

Vector search using embeddings, FAISS and Product Quantization with custom index & KMeans

License

Notifications You must be signed in to change notification settings

bariscamli/Vector-Search-with-FAISS

Repository files navigation

Vector Search using Embeddings, FAISS and Product Quantization

Overview

This project implements an efficient similarity search system for lecture content using embeddings, FAISS and Product Quantization with custom index & KMeans implementations. It allows you to find similar lectures based on textual content, enabling quick retrieval and recommendation of lectures.

Features

  • Data Preprocessing: Load and preprocess lecture and query data (Generated by ChatGPT).
  • Embeddings: Compute and normalize embeddings using a specified model.
  • FAISS Indexing: Build and evaluate a FAISS index for efficient similarity search.
  • Performance Evaluation: Compute recall and queries per second (QPS) metrics.
  • Quantization: Implement Product Quantization (PQ) with custom index to reduce storage requirements.
  • Visualization: Plot performance metrics for analysis.

Table of Contents

Installation

  1. Clone the Repository

    git clone https://github.com/bariscamli/Vector-Search-with-FAISS.git
    cd Vector-Search-with-FAISS
  2. Create a Virtual Environment (Optional but Recommended)

    python -m venv venv
    source venv/bin/activate  # On Windows use `venv\Scripts\activate`
  3. Create a Virtual Environment (Optional but Recommended)

    pip install -r requirements.txt

Data Preparation

  • Lecture Data: Place your lecture texts in a file specified by LECTURE_FILE in config.py. Each line should contain one lecture.

  • Query Data: Place your query texts in a file specified by QUERY_FILE in config.py. Each line should contain one query. Example format for lectures.txt:

    Introduction to Machine Learning
    Advanced Topics in Deep Learning
    Statistical Methods in Data Science
    ...
    

    Example format for queries.txt:

    Basics of Neural Networks
    Regression Analysis Techniques
    Clustering Algorithms Overview
    ...
    

Configuration

All configurations are managed through the config.py file. Key parameters include:

File Paths
- LECTURE_FILE: Path to the lecture data file.
- QUERY_FILE: Path to the query data file.
Embedding Model
- EMBEDDING_MODEL_NAME: Name or path of the embedding model to use.
- BATCH_SIZE: Batch size for computing embeddings.
FAISS Parameters
- FAISS_EFSEARCH_VALUES: List of efSearch values for performance evaluation.
Quantization Parameters
- PQ_M: Number of sub-vector quantizers.
- PQ_NBITS: Number of bits per sub-vector.
- KMEANS_MAX_ITER: Maximum iterations for k-means during PQ training.

Usage

Run the main script to execute the full pipeline:

python main.py

What Happens When You Run main.py

  1. Data Loading and Preprocessing

    • Lectures and queries are loaded from the specified files.
    • Text data is preprocessed (e.g., tokenization, cleaning).
  2. Embedding Computation

    • An embedding model is loaded as per EMBEDDING_MODEL_NAME.
    • Embeddings for lectures and queries are computed and normalized.
  3. Baseline Computation

    • A baseline similarity matrix is computed using dot products.
    • The baseline is used for performance comparison.
  4. FAISS Index Building and Evaluation

    • A FAISS index is built for the lecture embeddings.
    • The index is evaluated over different efSearch values.
    • Performance metrics (recall@1 and QPS) are computed.
  5. Performance Visualization

    • A plot is generated showing the trade-off between recall and QPS.
    • The plot is displayed using Matplotlib. Example
  6. Quantization

    • A custom PQ index (CustomIndexPQ) is created.
    • The index is trained and lectures are added to it.
  7. Example Search

    • An example search is performed using the PQ index.
    • Results are logged, showing similar lectures to a given lecture. Example

Dependencies

  • Python 3.7 or higher
  • Required Python packages (installed via requirements.txt):
    • numpy
    • matplotlib
    • faiss (Install via pip install faiss-cpu or faiss-gpu if you have a GPU)
    • logging
    • Embedding model libraries (e.g., transformers if using Hugging Face models)

License

This project is licensed under the MIT License. See the LICENSE
file for details.

Releases

No releases published

Packages

No packages published

Languages