Skip to content

Commit

Permalink
Merge pull request #39 from bukosabino/bukosabino-patch-2
Browse files Browse the repository at this point in the history
Update README.md
  • Loading branch information
bukosabino authored Nov 20, 2023
2 parents 5dba87b + c54dea4 commit 1576217
Showing 1 changed file with 15 additions and 12 deletions.
27 changes: 15 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# IA BOE

Question/Answering Assistant that generates answers from user questions about the Official State Gazette of Spain:
Question/Answering Assistant that generates answers from user questions about the official state gazette of Spain:
Boletín Oficial del Estado (BOE).

[Spanish link](https://www.boe.es)
Expand All @@ -15,15 +15,14 @@ query using the embedded question. The retrieved pieces of text are then sent to

![image (4)](https://github.com/bukosabino/ia-boe/assets/4375209/bb2ad4ce-f90a-40bf-a77f-bc1443b9896e)


## Flow

0. All BOE articles are embedded in vectors and stored in a vector database. This process is run at startup and every day.
0. All BOE articles are embedded as embeddings and stored in an embedding database. This process is run at startup and every day.
1. The user writes (using natural language) any question related to the BOE as input to the system.
2. The backend service processes the input request (user question), transforms the question into an embedding, and sends the generated embedding as a query to the embedding database.
3. The embedding database returns documents that most closely match the query.
4. The most similar documents returned by the embedding database are added as context to the input request. Then it sends a request with all the information to the LLM API model.
5. The LLM API model returns a natural language answer to the user's question.
4. The most similar documents returned by the embedding database are added to the input query as context. Then a request with all the information is sent to the LLM API model.
5. The LLM API model returns a natural language answer to the user's question.
6. The user receives an AI-generated response output.

## Components
Expand All @@ -43,9 +42,9 @@ It is the web service, and it is a central component for the whole system, doing

#### Loading data

We download the BOE documents and break them into small chunks of text (e.g. 2000 characters). Each text chunk is transformed into an embedding (e.g. a numerically dense vector of 768 sizes). Some additional metadata is also stored with the vectors so that we can pre- or post-filter the search results. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/etl_initial.py)
We download the BOE documents and break them into small chunks of text (e.g. 1200 characters). Each text chunk is transformed into an embedding (e.g. a numerically dense vector of 768 sizes). Some additional metadata is also stored with the vectors so that we can pre- or post-filter the search results. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/boe/load/run.py)

The BOE is updated every day, so we need to run an ETL job every day to retrieve the new documents, transform them into embeddings, link the metadata, and store them in the embedding database. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/etl_daily.py)
The BOE is updated every day, so we need to run an ETL job every day to retrieve the new documents, transform them into embeddings, link the metadata, and store them in the embedding database. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/boe/load/daily.py)

#### Reading data

Expand All @@ -59,18 +58,21 @@ There are different types of ANNs (cosine similarity, Euclidean distance, or dot

The text in BOE is written in Spanish, so we need a sentence transformer model that is fine-tuned using Spanish
datasets. We are experimenting with [these models](https://github.com/bukosabino/sbert-spanish).

More info: https://www.newsletter.swirlai.com/p/sai-notes-07-what-is-a-vector-database

### LLM API Model

It is a Large Language Model (LLM) which generates answers for the user questions based on the context, which is
the most similar documents returned by the embedding database.

Options:
* OpenAI (Third party API)
* Falcon (Own API)
* Llama2 (Own API)
## Tools

- Langchain
- FastAPI
- Qdrant
- BeautifulSoup
- gpt-4-1106-preview (OpenAI)

# Deploy your own service

Expand All @@ -84,6 +86,7 @@ Check `deployment_guide.md` file
* Generate smart questions from an article
* Use OpenAI Moderation API to filter wrong behaviours from users: https://platform.openai.com/docs/guides/moderation
* Create an OpenAI plugin
* Create a Justicio GPT on OpenAI

# Want to help develop the project?

Expand Down

0 comments on commit 1576217

Please sign in to comment.