This project is a collection of notebook and a simple flask web server to serve Gemma-2 using llama-cpp.
The goal of this project is to fine-tune a model to get a better result on the task of to the task of extracting data into a structured format (JSON).
You will need to provide the output schema in openapi format and the text (context).
The project is divided between notebook for the fine-tuning, quantization and evaluation and python files.
Source | Description |
---|---|
➡️ Gemma-2 Finetuning | A notebook that shows how tofine-tune and quantize gemma2-2b-it using the unsloth and hugging-face libraries. |
➡️ Server | A simple flask REST server using llama-cpp with a 4 bit quantized model. |
➡️ CI/CD | A github action consisting of a formatting/linting step with ruff, testing with pytest and building the docker image. |
➡️ Dockerfile | A mutlistage dockerfile to build the server with gunicorn. |
The different finetuned models can be found in safetensors and GGUF format (4bit, 8bit) on the hugging-face hub at bastienp/Gemma-2-2B-it-JSON-data-extration.
Note: It also gives more details on how to use it with llama-cpp or unsloth.
Recommended: Use the fast Python package installer and resolver uv from astral.
Alternatively, you can replace this command with pip. You can find the documentation for installing uv here.
- Sync the dependencies with uv
uv venv .venv
source .venv/bin/activate
uv sync --all-extras --dev # in addition it adds pytest and ruff
- Launch a flask dev server
flask --app src.web.app run --debug
To reproduce the fine-tuning, the easiest way is to use Google Collab (the free version is sufficient).
- Run the tests (API testing)
pytest
Note: An example of how to call the API and the prompt format can be found in examplesexample_api_call.py
.
In order to deploy the model the easiest way to go is to use the provided docker image.
- Pull the image from github (buit from the CI):
docker pull ghcr.io/bastienpo/unsloth_finetuning:main
Note: Otherwise you can build the image yourself
docker build -tag unsloth_finetuning:0.0.1 .
- Run the docker image
docker run -p 8000:8000 -d unsloth_finetuning:main # or 0.0.1
- Make a post request
curl -i -H "Content-Type: application/json" -X POST -d '{"query": "How are you ?"}' http://localhost:8000/api/v1/chat/completions