Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
bukosabino authored Nov 20, 2023
1 parent 5dba87b commit c54dea4
Showing 1 changed file with 15 additions and 12 deletions.
27 changes: 15 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# IA BOE

Question/Answering Assistant that generates answers from user questions about the Official State Gazette of Spain:
Question/Answering Assistant that generates answers from user questions about the official state gazette of Spain:
Boletín Oficial del Estado (BOE).

[Spanish link](https://www.boe.es)
Expand All @@ -15,15 +15,14 @@ query using the embedded question. The retrieved pieces of text are then sent to

![image (4)](https://github.com/bukosabino/ia-boe/assets/4375209/bb2ad4ce-f90a-40bf-a77f-bc1443b9896e)


## Flow

0. All BOE articles are embedded in vectors and stored in a vector database. This process is run at startup and every day.
0. All BOE articles are embedded as embeddings and stored in an embedding database. This process is run at startup and every day.
1. The user writes (using natural language) any question related to the BOE as input to the system.
2. The backend service processes the input request (user question), transforms the question into an embedding, and sends the generated embedding as a query to the embedding database.
3. The embedding database returns documents that most closely match the query.
4. The most similar documents returned by the embedding database are added as context to the input request. Then it sends a request with all the information to the LLM API model.
5. The LLM API model returns a natural language answer to the user's question.
4. The most similar documents returned by the embedding database are added to the input query as context. Then a request with all the information is sent to the LLM API model.
5. The LLM API model returns a natural language answer to the user's question.
6. The user receives an AI-generated response output.

## Components
Expand All @@ -43,9 +42,9 @@ It is the web service, and it is a central component for the whole system, doing

#### Loading data

We download the BOE documents and break them into small chunks of text (e.g. 2000 characters). Each text chunk is transformed into an embedding (e.g. a numerically dense vector of 768 sizes). Some additional metadata is also stored with the vectors so that we can pre- or post-filter the search results. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/etl_initial.py)
We download the BOE documents and break them into small chunks of text (e.g. 1200 characters). Each text chunk is transformed into an embedding (e.g. a numerically dense vector of 768 sizes). Some additional metadata is also stored with the vectors so that we can pre- or post-filter the search results. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/boe/load/run.py)

The BOE is updated every day, so we need to run an ETL job every day to retrieve the new documents, transform them into embeddings, link the metadata, and store them in the embedding database. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/etl_daily.py)
The BOE is updated every day, so we need to run an ETL job every day to retrieve the new documents, transform them into embeddings, link the metadata, and store them in the embedding database. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/boe/load/daily.py)

#### Reading data

Expand All @@ -59,18 +58,21 @@ There are different types of ANNs (cosine similarity, Euclidean distance, or dot

The text in BOE is written in Spanish, so we need a sentence transformer model that is fine-tuned using Spanish
datasets. We are experimenting with [these models](https://github.com/bukosabino/sbert-spanish).

More info: https://www.newsletter.swirlai.com/p/sai-notes-07-what-is-a-vector-database

### LLM API Model

It is a Large Language Model (LLM) which generates answers for the user questions based on the context, which is
the most similar documents returned by the embedding database.

Options:
* OpenAI (Third party API)
* Falcon (Own API)
* Llama2 (Own API)
## Tools

- Langchain
- FastAPI
- Qdrant
- BeautifulSoup
- gpt-4-1106-preview (OpenAI)

# Deploy your own service

Expand All @@ -84,6 +86,7 @@ Check `deployment_guide.md` file
* Generate smart questions from an article
* Use OpenAI Moderation API to filter wrong behaviours from users: https://platform.openai.com/docs/guides/moderation
* Create an OpenAI plugin
* Create a Justicio GPT on OpenAI

# Want to help develop the project?

Expand Down

0 comments on commit c54dea4

Please sign in to comment.