Merge pull request #39 from bukosabino/bukosabino-patch-2

Update README.md
bukosabino · Nov 20, 2023 · 1576217 · 1576217
2 parents 5dba87b + c54dea4
commit 1576217
Showing 1 changed file with 15 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # IA BOE
 
-Question/Answering Assistant that generates answers from user questions about the Official State Gazette of Spain: 
+Question/Answering Assistant that generates answers from user questions about the official state gazette of Spain: 
 Boletín Oficial del Estado (BOE).
 
 [Spanish link](https://www.boe.es)
@@ -15,15 +15,14 @@ query using the embedded question. The retrieved pieces of text are then sent to
 
 ![image (4)](https://github.com/bukosabino/ia-boe/assets/4375209/bb2ad4ce-f90a-40bf-a77f-bc1443b9896e)
 
-
 ## Flow
 
-0. All BOE articles are embedded in vectors and stored in a vector database. This process is run at startup and every day.
+0. All BOE articles are embedded as embeddings and stored in an embedding database. This process is run at startup and every day.
 1. The user writes (using natural language) any question related to the BOE as input to the system.
 2. The backend service processes the input request (user question), transforms the question into an embedding, and sends the generated embedding as a query to the embedding database.
 3. The embedding database returns documents that most closely match the query.
-4. The most similar documents returned by the embedding database are added as context to the input request. Then it sends a request with all the information to the LLM API model.
-5. The LLM API model returns a natural language answer to the user's question. 
+4. The most similar documents returned by the embedding database are added to the input query as context. Then a request with all the information is sent to the LLM API model.
+5. The LLM API model returns a natural language answer to the user's question.
 6. The user receives an AI-generated response output.
 
 ## Components
@@ -43,9 +42,9 @@ It is the web service, and it is a central component for the whole system, doing
 
 #### Loading data
 
-We download the BOE documents and break them into small chunks of text (e.g. 2000 characters). Each text chunk is transformed into an embedding (e.g. a numerically dense vector of 768 sizes). Some additional metadata is also stored with the vectors so that we can pre- or post-filter the search results. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/etl_initial.py)
+We download the BOE documents and break them into small chunks of text (e.g. 1200 characters). Each text chunk is transformed into an embedding (e.g. a numerically dense vector of 768 sizes). Some additional metadata is also stored with the vectors so that we can pre- or post-filter the search results. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/boe/load/run.py)
 
-The BOE is updated every day, so we need to run an ETL job every day to retrieve the new documents, transform them into embeddings, link the metadata, and store them in the embedding database. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/etl_daily.py)
+The BOE is updated every day, so we need to run an ETL job every day to retrieve the new documents, transform them into embeddings, link the metadata, and store them in the embedding database. [Check the code](https://github.com/bukosabino/ia-boe/blob/main/src/etls/boe/load/daily.py)
 
 #### Reading data
 
@@ -59,18 +58,21 @@ There are different types of ANNs (cosine similarity, Euclidean distance, or dot
 
 The text in BOE is written in Spanish, so we need a sentence transformer model that is fine-tuned using Spanish 
 datasets. We are experimenting with [these models](https://github.com/bukosabino/sbert-spanish).
-
+  
 More info: https://www.newsletter.swirlai.com/p/sai-notes-07-what-is-a-vector-database
 
 ### LLM API Model
 
 It is a Large Language Model (LLM) which generates answers for the user questions based on the context, which is
 the most similar documents returned by the embedding database.
 
-Options:
-* OpenAI (Third party API)
-* Falcon (Own API)
-* Llama2 (Own API)
+## Tools
+
+- Langchain
+- FastAPI
+- Qdrant
+- BeautifulSoup
+- gpt-4-1106-preview (OpenAI)
 
 # Deploy your own service
 
@@ -84,6 +86,7 @@ Check `deployment_guide.md` file
 * Generate smart questions from an article
 * Use OpenAI Moderation API to filter wrong behaviours from users: https://platform.openai.com/docs/guides/moderation
 * Create an OpenAI plugin
+* Create a Justicio GPT on OpenAI
 
 # Want to help develop the project?