This is a mirror repo of MemoRAG. Please refer to the original repo for the latest updates.
MemoRAG is an innovative RAG framework built on top of a highly efficient, super-long memory model. Unlike standard RAG, which primarily handles queries with explicit information needs, MemoRAG leverages its memory model to achieve a global understanding of the entire database. By recalling query-specific clues from memory, MemoRAG enhances evidence retrieval, resulting in more accurate and contextually rich response generation.
- Global Memory: Handles up to 1 million tokens in a single context, providing comprehensive understanding across massive datasets.
- Optimizable & Flexible: Adapts to new tasks with ease, achieving optimized performance with just a few hours of additional training.
- Contextual Clues: Generates precise clues from global memory, bridging raw input to answers and unlocking hidden insights from complex data.
- Efficient Caching: Speeds up context pre-filling by up to 30x, with support for caching chunking, indexing, and encoding.
- Context Reuse: Encodes long contexts once and supports repeated usage, boosting efficiency in tasks that require recurring data access.
MemoRAG is currently under active development, with resources and prototypes continuously being published at this repository.
- Initial Codes Release
- Memory Model Release
- Support OpenAI/Azure models
- Add evaluation scripts on benchmarks
- Dataset Release
- Technical Report Release
- Demo Codes Release
- Training Codes for Memory model Release
- Support Any Model as Memory model
[10/09/24] We release MemoRAG's Technical Report
.
[09/09/24] You can try MemoRAG on Google Colab
for free.
[05/09/24] A Qwen2-based memory model is available at TommyChien/memorag-qwen2-7b-inst
.
[03/09/24] A Mistral-based memory model is available at TommyChien/memorag-mistral-7b-inst
.
[01/09/24] The project launched!
🆓 You can directly try MemoRAG on Google Colab
for free.
In this notebook, we run the complete MemoRAG pipeline (Memory Model + Retriever + Generation Model) on a single T4 GPU with 15GiB of memory provided by Google Colab. Despite the limited resources, MemoRAG can process half of the content from the example book (~68K tokens) and perform all of its functions.
To use Memorizer and MemoRAG, you need to have Python installed along with the required libraries. You can install the necessary dependencies using the following command:
Install Dependencies
pip install torch==2.3.1
conda install -c pytorch -c nvidia faiss-gpu=1.8.0
Install from source
# clone this repo first
cd MemoRAG
pip install -e .
Install via pip
pip install memorag
For Quick Start, We provide a notebook to illustrate all functions of MemoRAG here.
MemoRAG is easy to use and can be initialized with HuggingFace models directly. By using the MemoRAG.memorize()
method, the memory model builds a global memory over a long input context. Empirically, with default parameter settings, TommyChien/memorag-qwen2-7b-inst
can handle contexts of up to 400K tokens, while TommyChien/memorag-mistral-7b-inst
can manage contexts up to 128K tokens. By increasing the beacon_ratio
parameter, the model’s capacity to handle longer contexts can be extended. For example, TommyChien/memorag-qwen2-7b-inst
can process up to one million tokens with beacon_ratio=16
.
from memorag import MemoRAG
# Initialize MemoRAG pipeline
pipe = MemoRAG(
mem_model_name_or_path="TommyChien/memorag-mistral-7b-inst",
ret_model_name_or_path="BAAI/bge-m3",
gen_model_name_or_path="mistralai/Mistral-7B-Instruct-v0.2", # Optional: if not specify, use memery model as the generator
cache_dir="path_to_model_cache", # Optional: specify local model cache directory
access_token="hugging_face_access_token", # Optional: Hugging Face access token
beacon_ratio=4
)
context = open("examples/harry_potter.txt").read()
query = "How many times is the Chamber of Secrets opened in the book?"
# Memorize the context and save to cache
pipe.memorize(context, save_dir="cache/harry_potter/", print_stats=True)
# Generate response using the memorized context
res = pipe(context=context, query=query, task_type="memorag", max_new_tokens=256)
print(f"MemoRAG generated answer: \n{res}")
When running the above code, the encoded key-value (KV) cache, Faiss index, and chunked passages are stored in the specified save_dir
. Afterward, if the same context is used again, the data can be quickly loaded from the disk:
pipe.load("cache/harry_potter/", print_stats=True)
Typically, loading cached weights is highly efficient. For example, encoding, chunking, and indexing a 200K-token context takes approximately 35 seconds using TommyChien/memorag-qwen2-7b-inst
as the memory model, but only 1.5 seconds when loading from cached files.
To perform summarization tasks, use the following script:
res = pipe(context=context, task_type="summarize", max_new_tokens=512)
print(f"MemoRAG summary of the full book:\n {res}")
If you want to use APIs as a generator, refer to the script below:
from memorag import Agent, MemoRAG
# API configuration
api_dict = {
"endpoint": "",
"api_version": "2024-02-15-preview",
"api_key": ""
}
model = "gpt-35-turbo-16k"
source = "azure"
# Initialize Agent with the API
agent = Agent(model, source, api_dict)
print(agent.generate("hi!")) # Test the API
# Initialize MemoRAG pipeline with a customized generator model
pipe = MemoRAG(
mem_model_name_or_path="TommyChien/memorag-qwen2-7b-inst",
ret_model_name_or_path="BAAI/bge-m3",
cache_dir="path_to_model_cache", # Optional: specify local model cache directory
customized_gen_model=agent,
)
# Load previously cached context
pipe.load("cache/harry_potter_qwen/", print_stats=True)
# Use the loaded context for question answering
query = "How are the mutual relationships between the main characters?"
context = open("harry_potter.txt").read()
res = pipe(context=context, query=query, task_type="memorag", max_new_tokens=256)
print(f"MemoRAG with GPT-3.5 generated answer: \n{res}")
The built-in Agent
object supports models from both openai
and deepseek
. Below are the configurations for initializing these models:
# Using deepseek models
model = ""
source = "deepseek"
api_dict = {
"base_url": "",
"api_key": ""
}
# Using openai models
model = ""
source = "openai"
api_dict = {
"api_key": ""
}
The Memory model can be used independently to store, recall, and interact with the context. Here’s an example:
from memorag import Memory
# Initialize the Memory model
memo_model = Memory(
"TommyChien/memorag-qwen2-7b-inst",
cache_dir="path_to_model_cache", # Optional: specify local model cache directory
beacon_ratio=4 # Adjust beacon ratio for handling longer contexts
)
# Load and memorize the context
context = open("harry_potter.txt").read()
memo_model.memorize(context)
# Save the memorized context to disk
memo_model.save("cache/harry_potter/memory.bin")
# Query the model for answers
query = "How are the mutual relationships between the main characters?"
res = memo_model.answer(query)
print("Using memory to answer the query:\n", res)
# Recall text clues for evidence retrieval
res = memo_model.recall(query)
print("Using memory to recall text clues to support evidence retrieval:\n", res)
# Rewrite the query into more specific surrogate queries
res = memo_model.rewrite(query)
print("Using memory to rewrite the input query into more specific surrogate queries:\n", res)
In addition to the standalone Memory Model, MemoRAG provides memory-augmented retrieval functionality. This allows for improved evidence retrieval based on recalled clues from memory.
from memorag import MemoRAG
# Initialize MemoRAG pipeline
pipe = MemoRAG(
mem_model_name_or_path="TommyChien/memorag-qwen2-7b-inst",
ret_model_name_or_path="BAAI/bge-m3",
cache_dir="path_to_model_cache", # Optional: specify local model cache directory
access_token="hugging_face_access_token" # Optional: Hugging Face access token
)
# Load and memorize the context
test_txt = open("harry_potter.txt").read()
pipe.memorize(test_txt, save_dir="cache/harry_potter/", print_stats=True)
# Define the query
query = "How are the mutual relationships between the main characters?"
# Recall clues from memory
clues = pipe.mem_model.recall(query).split("\n")
clues = [q for q in clues if len(q.split()) > 3] # Filter out short or irrelevant clues
print("Clues generated from memory:\n", clues)
# Retrieve relevant passages based on the recalled clues
retrieved_passages = pipe._retrieve(clues)
print("\n======\n".join(retrieved_passages[:3]))
Below are experiments results for the memory model, incorporating with three generation models.
Dataset | NarrativeQA | Qasper | MultifieldQA | Musique | 2Wiki | HotpotQA | MultiNews | GovReport | En.sum | En.qa | Fin | Legal | Mix |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LongBench | InfBench | UltraDomain | |||||||||||
Generator: Llama3-8B-Instruct-8K | |||||||||||||
Full | 21.3 | 43.4 | 46.6 | 23.5 | 38.2 | 47.1 | 24.6 | 23.6 | 13.1 | 6.7 | 34.2 | 33.2 | 42.7 |
BGE-M3 | 22.1 | 44.3 | 50.2 | 22.2 | 36.7 | 48.4 | 22.1 | 20.1 | 12.1 | 15.1 | 41.4 | 40.6 | 46.4 |
Stella-v5 | 12.3 | 35.2 | 44.4 | 22.1 | 33.3 | 41.9 | 22.1 | 20.7 | 11.7 | 14.8 | 41.9 | 33.7 | 44.9 |
RQ-RAG | 20.2 | 43.9 | 49.1 | 22.7 | 36.1 | 44.5 | 20.6 | 21.0 | 12.0 | 13.3 | 39.5 | 36.8 | 44.5 |
HyDE | 22.1 | 44.3 | 50.2 | 22.2 | 36.7 | 48.4 | - | - | - | 19.1 | 41.4 | 40.6 | 46.4 |
MemoRAG | 22.8 | 45.7 | 50.7 | 28.4 | 51.4 | 57.0 | 27.4 | 27.9 | 14.1 | 16.1 | 47.8 | 47.9 | 55.5 |
Generator: Phi-3-mini-128K | |||||||||||||
Full | 21.4 | 35.0 | 47.3 | 19.0 | 35.5 | 42.1 | 25.6 | 23.7 | 13.0 | 15.2 | 44.8 | 40.5 | 44.7 |
BGE-M3 | 20.3 | 33.0 | 44.3 | 21.1 | 35.4 | 42.1 | 17.7 | 19.8 | 9.6 | 16.3 | 41.7 | 41.2 | 43.7 |
Stella-v5 | 13.7 | 32.4 | 43.5 | 21.0 | 35.6 | 40.6 | 20.3 | 18.2 | 10.0 | 19.5 | 42.8 | 35.1 | 43.9 |
RQ-RAG | 19.6 | 34.1 | 46.5 | 21.9 | 36.1 | 41.7 | 20.1 | 18.6 | 10.4 | 16.1 | 41.8 | 40.9 | 43.2 |
HyDE | 18.7 | 36.0 | 47.5 | 20.5 | 36.8 | 42.7 | - | - | - | 19.6 | 43.1 | 41.6 | 44.2 |
MemoRAG | 27.5 | 43.9 | 52.2 | 33.9 | 54.1 | 54.8 | 32.9 | 26.3 | 15.7 | 22.9 | 51.5 | 51.0 | 55.6 |
Generator: Mistral-7B-Instruct-v0.2-32K | |||||||||||||
Full | 20.8 | 29.2 | 46.3 | 18.9 | 20.6 | 37.6 | 23.0 | 20.4 | 12.4 | 12.3 | 36.5 | 35.8 | 42.1 |
BGE-M3 | 17.3 | 29.5 | 46.3 | 18.5 | 20.3 | 36.2 | 24.3 | 26.1 | 13.5 | 12.2 | 40.5 | 42.0 | 41.1 |
Stella-v5 | 13.5 | 23.7 | 42.1 | 18.6 | 22.2 | 31.9 | 21.1 | 18.5 | 13.2 | 9.7 | 40.9 | 34.9 | 42.1 |
RQ-RAG | 17.1 | 29.2 | 47.0 | 19.1 | 21.5 | 37.0 | 22.1 | 18.6 | 13.1 | 12.7 | 44.3 | 44.6 | 43.4 |
HyDE | 17.4 | 29.5 | 46.3 | 18.5 | 20.1 | 36.2 | - | - | - | 12.2 | 42.8 | 35.1 | 43.9 |
MemoRAG | 23.1 | 31.2 | 50.0 | 26.9 | 30.3 | 42.9 | 27.1 | 31.6 | 17.9 | 15.4 | 48.0 | 51.2 | 53.6 |
MemoRAG-qwen2 | 22.2 | 32.7 | 49.6 | 31.4 | 33.7 | 44.4 | 27.0 | 31.5 | 16.8 | 17.6 | 48.7 | 52.3 | 48.6 |
To evaluate MemoRAG, use the following script:
cd examples
bash longbench/eval.sh
We will update other evaluation scripts soon.
UltraDomain Benchmark: this repo.
Other Evaluation Data: this repo.
MemoRAG is licensed under the MIT License.
If you use MemoRAG in your research, please cite our paper:
@misc{qian2024memorag,
title={MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery},
author={Hongjin Qian and Peitian Zhang and Zheng Liu and Kelong Mao and Zhicheng Dou},
year={2024},
eprint={2409.05591},
url={https://arxiv.org/abs/2409.05591},
}