Skip to content

Commit

Permalink
Add GenAI example for using OCI images for model storage (#447)
Browse files Browse the repository at this point in the history
* Add GenAI example for using OCI images for model storage

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Fix review comments: @guimou

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe

Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
Signed-off-by: Edgar Hernández <ehernand@redhat.com>

* Fix review comments: Filippe and DanieleZ.

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

* Feedback: Filippe

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>

---------

Signed-off-by: Edgar Hernández <23639005+israel-hdez@users.noreply.github.com>
Signed-off-by: Edgar Hernández <ehernand@redhat.com>
Co-authored-by: Filippe Spolti <filippespolti@gmail.com>
  • Loading branch information
israel-hdez and spolti authored Dec 20, 2024
1 parent 4b2f139 commit 9622f4b
Showing 1 changed file with 238 additions and 48 deletions.
286 changes: 238 additions & 48 deletions docs/odh/oci-model-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,233 @@ which also explains how to deploy models from OCI images.

This page offers a guide similar to the upstream project documentation, but
focusing on the OpenDataHub and OpenShift characteristics. To demonstrate
how to create an OCI container image, the publicly available [MobileNet v2-7
model](https://github.com/onnx/models/tree/main/validated/vision/classification/mobilenet)
is used. This model is in ONNX format.
how to create and use OCI containers, two examples are provided:
* The first example uses [IBM's Granite-3.0-2B-Instruct model](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct)
available in Hugging Face. This is a generative AI model.
* The second example uses the [MobileNet v2-7 model](https://github.com/onnx/models/tree/main/validated/vision/classification/mobilenet)
is used. This is a predictive AI model in ONNX format.

## Creating and deploying an OCI image of IBM's Granite-3.0-2B-Instruct model

IBM's Granite-3.0-2B-Instruct model is [available at Hugging Face](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct).
To create an OCI container image, the model needs to be downloaded and copied into
the container. Once the OCI image is built and published in a registry, it can be
deployed on the cluster.

The ODH project provides configurations for the vLLM model server, which
supports running the Granite model. Thus, this guide will use this model server
to demonstrate how to deploy the Granite model stored in an OCI image.

### Storing the Granite model in an OCI image

To download the Granite model, the [`huggingface-cli download` command](https://huggingface.co/docs/huggingface_hub/guides/cli#huggingface-cli-download) will be
used on the OCI container build process to download the model and build the final
container image. The process is as follows:
* Install the huggingface CLI
* Use huggingface CLI to download the model
* Create the final OCI using the downloaded model

This process is implemented in the following multi-stage container build. Create a file named
`Containerfile` with the following contents:
```Dockerfile
##### Stage 1: Download the model
FROM registry.access.redhat.com/ubi9/python-312:latest as downloader

# Install huggingface-cli
RUN pip install "huggingface_hub[cli]"

# Download the model
ARG repo_id
ARG token
RUN mkdir models && huggingface-cli download --quiet --max-workers 2 --token "${token}" --local-dir ./models $repo_id

##### Stage 2: Build the final OCI model container
FROM registry.access.redhat.com/ubi9/ubi-micro:latest as model

# Copy from the download stage
COPY --from=downloader --chown=1001:0 /opt/app-root/src/models /models

# Set proper privileges for KServe
RUN chmod -R a=rX /models

# Use non-root user as default
USER 1001
```

> [!TIP]
> This Containerfile should be generic enough to download and containerize any
> model from Hugging Face. However, it has only been tested with the Granite model.
> Feel free to try it with any other model that can work with the vLLM server.
Notice that model files are copied into `/models` inside the final container. KServe
expects this path to exist in the OCI image and also expects the model files to
be inside it.

Also, notice that `ubi9-micro` is used as a base container of the final image.
Empty images, like `scratch` cannot be used, because KServe needs to configure the model image
with a command to keep it alive and ensure the model files remain available in
the pod. Thus, it is required to use a base image that provides a shell.

Finally, notice that ownership of the copied model files is changed to the `root`
group, and also read permissions are granted. This is important, because OpenShift
runs containers with a random user ID and with the `root` group ID. The adjustment
of the group and the privileges on the model files ensures that the model server
can access them.

Create the OCI container image of the Granite model using Podman, and upload it to
a registry. For example, using Quay as the registry:
```shell
podman build --format=oci --squash \
--build-arg repo_id=ibm-granite/granite-3.0-2b-instruct \
-t quay.io/<user_name>/<repository_name>:<tag_name> .

podman push quay.io/<user_name>/<repository_name>:<tag_name>
```

It is important to use the `--squash` flag to prevent the final image having
the double size of the model.

> [!TIP]
> If you have access to gated repositories, you can provide the optional argument
> `--build-arg token="{your_access_token}"` to containerize models from your
> accessible gated repositories.
The ODH projects provides configurations for the OpenVINO model server, which
> [!TIP]
> When uploading your container image, if your repository is private, ensure you
> are authenticated to the registry.
### Deploying the Granite model using the generated OCI image

Start by creating a namespace to deploy the model:
```shell
oc new-project oci-model-example
```

In the newly created namespace, you need to create a `ServingRuntime` resource
configuring vLLM model server. The ODH project provides templates with
configurations for some model servers. The template that is applicable for
KServe and holds the vLLM configuration is the one named as `vllm-runtime-template`:
```shell
oc get templates -n opendatahub vllm-runtime-template

NAME DESCRIPTION PARAMETERS OBJECTS
vllm-runtime-template vLLM is a high-throughput and memory-efficient inference and serving engine f... 0 (all set) 1
```

To create an instance of it, run the following command:
```shell
oc process -n opendatahub -o yaml vllm-runtime-template | oc apply -f -
```

You can verify that the `ServingRuntime` has been created successfully with the
following command:
```shell
oc get servingruntimes

NAME DISABLED MODELTYPE CONTAINERS AGE
vllm-runtime vLLM kserve-container 11s
```

Notice that the ServingRuntime has been created with `vllm-runtime` name.

Now that the `ServingRuntime` is configured, the Granite model stored in an OCI image can
be deployed by creating an `InferenceService` resource:
```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sample-isvc-using-oci
spec:
predictor:
model:
runtime: vllm-runtime # This is the name of the ServingRuntime resource
modelFormat:
name: vLLM
storageUri: oci://quay.io/<user_name>/<repository_name>:<tag_name>
args:
- --dtype=half
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
```
> [!IMPORTANT]
> The resulting `ServingRuntime` and `InferenceService` configurations won't set
> any CPU and memory limits.

> [!NOTE]
> The additional `--dtype=half` argument is not required if your GPU has compute
> capability greater than 8.

Once the `InferenceService` resource is created, KServe will deploy the model
stored in the OCI image referred by the `storageUri` field. Check the status
of the deployment with the following command:
```shell
oc get inferenceservice
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
sample-isvc-using-oci https://sample-isvc-using-oci-oci-model-example.example True 100 sample-isvc-using-oci-predictor-00001 2m11s
```

> [!IMPORTANT]
> Remember that, by default, models are exposed outside the cluster and not
> protected with authorization. Read the [authorization guide](authorization.md#deploying-a-protected-inferenceservice)
> and the [private services guide (TODO)](#TODO) to learn how to privately deploy
> models and how to protect them with authorization.

Test the model is working:
```sh
# Query
curl https://sample-isvc-using-oci-oci-model-example.apps.rosa.ehernand-test.v16g.p3.openshiftapps.com/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sample-isvc-using-oci",
"prompt": "What is the IBM granite-3 model?",
"max_tokens": 200,
"temperature": 0.8
}' | jq
# Response:
{
"id": "cmpl-639e7e1911e942eeb34bc9db9ff9c9fc",
"object": "text_completion",
"created": 1733527176,
"model": "sample-isvc-using-oci",
"choices": [
{
"index": 0,
"text": "\n\nThe IBM Granite-3 is a high-performance computing system designed for complex data analytics, artificial intelligence, and machine learning applications. It is built on IBM's
Power10 processor-based architecture and features:\n\n1. **Power10 Processor-based Architecture**: The IBM Granite-3 is powered by IBM's latest Power10 processor, which provides high performance
, energy efficiency, and advanced security features.\n\n2. **High-Performance Memory**: The system features high-speed memory, enabling fast data access and processing, which is crucial for comp
lex data analytics and AI applications.\n\n3. **Large Memory Capacity**: The IBM Granite-3 supports large memory capacities, allowing for the processing of vast amounts of data.\n\n4. **High-Spe
ed Interconnect**: The system features a high-speed interconnect, enabling quick data transfer between system components.\n\n5. **Advanced Security Features",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 210,
"completion_tokens": 200
}
}
```

## Creating and deploying an OCI image of MobileNet v2-7 model

The MobileNet v2-7 model is available at the [onnx/models](https://github.com/onnx/models/tree/main/validated/vision/classification/mobilenet)
GitHub repository. This model is in ONNX format and in this example you will
download it.

The ODH project provides configurations for the OpenVINO model server, which
supports models in ONNX format. Thus, this guide will use this model server
to demonstrate how deploy the MobileNet v2-7 model stored in an OCI image.

## Storing a model in an OCI image
### Storing the MobileNet v2-7 model in an OCI image

Start by creating an empty directory for downloading the model and creating
the necessary support files to create the OCI image. You may use a temporary
Expand Down Expand Up @@ -48,28 +266,21 @@ curl -L $DOWNLOAD_URL -O --output-dir models/1/

Create a file named `Containerfile` with the following contents:
```Dockerfile
FROM registry.access.redhat.com/ubi8/ubi-micro:latest
COPY --chown=0:0 models /models
RUN chmod -R a=rX /models
FROM registry.access.redhat.com/ubi9/ubi-micro:latest
# nobody user
USER 65534
```
# Copy the downloaded model
COPY --chown=1001:0 models /models
Notice that model files are copied into `/models` inside the container. KServe
expects this path to exist in the OCI image and also expects the model files to
be inside it.
# Set proper privileges for KServe
RUN chmod -R a=rX /models
Also, notice that `ubi8-micro` is used as a base container image. Empty images, like
`scratch` cannot be used, because KServe needs to configure the model image
with a command to keep it alive and ensure the model files remain available in
the pod. Thus, it is required to use a base image that provides a shell.
# Use non-root user as default
USER 1001
```

Finally, notice that ownership of the copied model files is changed to the `root`
group, and also read permissions are granted. This is important, because OpenShift
runs containers with a random user ID and with the `root` group ID. The adjustment
of the group and the privileges on the model files ensures that the model server
can access them.
Similarly to the Granite example, notice that model files are copied into `/models`,
that the ownership of the copied model files is changed to the `root` group with
read permissions granted, and that empty base images like `scratch` cannot be used.

Verify that the directory structure is good using the `tree` command:
```shell
Expand All @@ -88,40 +299,19 @@ tree
Create the OCI container image with Podman, and upload it to a registry. For
example, using Quay as the registry:
```shell
podman build --format=oci -t quay.io/<user_name>/<repository_name>:<tag_name> .
podman build --format=oci --squash -t quay.io/<user_name>/<repository_name>:<tag_name> .
podman push quay.io/<user_name>/<repository_name>:<tag_name>
```

> [!TIP]
> When uploading your container image, if your repository is private, ensure you
> are authenticated to the registry.
## Deploying a model stored in an OCI image in a public repository
### Deploying the MobileNet v2-7 model using the generated OCI image

Start by creating a namespace to deploy the model:
```shell
oc new-project oci-model-example
```

In the newly created namespace, you need to create a `ServingRuntime` resource
configuring OpenVINO model server. The ODH project provides templates with
configurations for some model servers, which you can list with the following
command:
```shell
oc get template -n opendatahub

NAME DESCRIPTION PARAMETERS OBJECTS
caikit-standalone-serving-template Caikit is an AI toolkit that enables users to manage models through a set of... 0 (all set) 1
caikit-tgis-serving-template Caikit is an AI toolkit that enables users to manage models through a set of... 0 (all set) 1
kserve-ovms OpenVino Model Serving Definition 0 (all set) 1
ovms OpenVino Model Serving Definition 0 (all set) 1
tgis-grpc-serving-template Text Generation Inference Server (TGIS) is a high performance inference engin... 0 (all set) 1
vllm-runtime-template vLLM is a high-throughput and memory-efficient inference and serving engine f... 0 (all set) 1
```

The template that is applicable for KServe and holds the OpenVINO configuration
is the one named as `kserve-ovms`. To create an instance of it, run the
following command:
As commented, the OpenVINO model server is used to deploy the MobileNet model.
Create the OpenVINO `ServingRuntime` from the provided template:
```shell
oc process -n opendatahub -o yaml kserve-ovms | oc apply -f -
```
Expand Down

0 comments on commit 9622f4b

Please sign in to comment.