Skip to content

📃 A contracts clause summarization system using LLM and vector database

Notifications You must be signed in to change notification settings

d1pankarmedhi/legal_summarizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Legal Summarizer

Pandas Python Streamlit

Summarize contracts and legal documents leveraging Information Retrieval and context Augmentation using Large Language Models.

Fig: Highlevel design/architecture diagram


🥱 Hard to understand contract agreements

Legal documents often contain complex terminologies that are not used in everyday language. These terms can be confusing and may require domain knowledge to understand their meanings.

These documents use highly technical and precise language, which can be challenging for non-legal professionals to grasp. They tend to have long sentences or paragraphs, making it challenging for the public to content and extract key points.

Legal documents incorporate references to statutes, regulations, clauses, and other legal citations, assuming a deep understanding of the legal system. The absence of straightforward language in these documents can make the content feel distant and unrelatable.

These documents are typically written to avoid any risks so their cautious and conservative wording results in complex sentences, aiming to counter every conceivable scenario and misinterpreting these documents can have serious consequences. This is the reason, why most of the time general public/readers avoid engaging in any scene that involves handling legal contacts/agreements by themselves.

🦾 Tackling this challenge with LLMs

With the rise of large language models and organizations like Openai, LLMs have become more accessible than they ever were. These language models laid the foundation of what we call "Augmentation" or reconstructing a piece of context based on its usage and applicability.

These huge models are so smart that they can produce human-like, sometimes better, results when asked to reconstruct a piece of text as per the instruction. The limitations occur on how well the model handles the input data. This depends on the type of data the model is trained on, or how big the model is. We are in an era where there are models with billions of parameters (both open-source and closed-source). This depends on the user and the problem it wants to solve.

📯 RAG makes it easy

RAG or Retrieval Augmented Generation is used in this project. The idea is to do a search on the documents and produce relevant answers for the search query.

One can use the BM25 search ranking algorithm which is a keyword-based search algorithm or something called semantic search, that doesn't just rely on keyword search but takes the nearby context into account while producing embeddings for similarity search.

After the relevant documents are retrieved, the context is then passed through an LLM that is responsible for the reconstruction of the piece of context into a more refined and usable answer.

Read more about RAG and how to use Langchain to build a Q&A system on this blog Exploring power of RAG and OpenAI's function calling for question answering

🧰 Getting started

Create a virtual-env and install the requirements

$ python -m venn venv
$ venv\Scripts\activate # windows
$ source venv/bin/activate # linux
$ pip install -r requirements.txt 

Start the application

$ streamlit run summarize.py