diff --git a/best-practices/ml-platform/.gitignore b/best-practices/ml-platform/.gitignore
index 8e6c23947..eb41f5daa 100644
--- a/best-practices/ml-platform/.gitignore
+++ b/best-practices/ml-platform/.gitignore
@@ -1,2 +1,7 @@
+.venv/
+mlruns/
+venv/
+
+.python-version
test/log/*.log
test/scripts/locks/*.lock
diff --git a/best-practices/ml-platform/README.md b/best-practices/ml-platform/README.md
index 09e7ff061..b5cd0a70e 100644
--- a/best-practices/ml-platform/README.md
+++ b/best-practices/ml-platform/README.md
@@ -60,7 +60,9 @@ For an outline of products and features used in the platform, see the [Platform
## Use cases
-- [Distributed Data Processing with Ray](examples/use-case/ray/dataprocessing/README.md): Run a distributed data processing job using Ray.
+- [Distributed Data Processing with Ray](examples/use-case/data-processing/ray/README.md): Run a distributed data processing job using Ray.
+- [Dataset Prepartion for Fine Tuning Gemma IT With Gemini Flash](examples/use-case/data-preparation/gemma-it/README.md): Generate prompts for fine tuning Gemma Instruction Tuned model with Vertex AI Gemini Flash
+- [Fine Tuning Gemma2 9B IT model With FSDP](examples/use-case/fine-tuning/pytorch/README.md): Fine tune Gemma2 9B IT model with PyTorch FSDP
## Resources
diff --git a/best-practices/ml-platform/docs/images/create-log-based-metrics.png b/best-practices/ml-platform/docs/images/create-log-based-metrics.png
new file mode 100644
index 000000000..c74a52c08
Binary files /dev/null and b/best-practices/ml-platform/docs/images/create-log-based-metrics.png differ
diff --git a/best-practices/ml-platform/docs/images/log-analytics-data-processing.png b/best-practices/ml-platform/docs/images/log-analytics-data-processing.png
new file mode 100644
index 000000000..328b5b80d
Binary files /dev/null and b/best-practices/ml-platform/docs/images/log-analytics-data-processing.png differ
diff --git a/best-practices/ml-platform/docs/images/log-analytics.png b/best-practices/ml-platform/docs/images/log-analytics.png
new file mode 100644
index 000000000..38fe6b0c3
Binary files /dev/null and b/best-practices/ml-platform/docs/images/log-analytics.png differ
diff --git a/best-practices/ml-platform/docs/images/log-explorer-query.png b/best-practices/ml-platform/docs/images/log-explorer-query.png
new file mode 100644
index 000000000..0e3f3bbaf
Binary files /dev/null and b/best-practices/ml-platform/docs/images/log-explorer-query.png differ
diff --git a/best-practices/ml-platform/docs/images/mlflow-home.png b/best-practices/ml-platform/docs/images/mlflow-home.png
new file mode 100644
index 000000000..43998db1a
Binary files /dev/null and b/best-practices/ml-platform/docs/images/mlflow-home.png differ
diff --git a/best-practices/ml-platform/docs/images/mlflow-model-experiment.png b/best-practices/ml-platform/docs/images/mlflow-model-experiment.png
new file mode 100644
index 000000000..a320958c2
Binary files /dev/null and b/best-practices/ml-platform/docs/images/mlflow-model-experiment.png differ
diff --git a/best-practices/ml-platform/docs/images/monitor-job.png b/best-practices/ml-platform/docs/images/monitor-job.png
new file mode 100644
index 000000000..4c604e4fb
Binary files /dev/null and b/best-practices/ml-platform/docs/images/monitor-job.png differ
diff --git a/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture.svg b/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture.svg
new file mode 100644
index 000000000..971558fca
--- /dev/null
+++ b/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture.svg
@@ -0,0 +1,263 @@
+
+
diff --git a/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture_full.svg b/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture_full.svg
new file mode 100644
index 000000000..44368f1be
--- /dev/null
+++ b/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture_full.svg
@@ -0,0 +1,339 @@
+
+
diff --git a/best-practices/ml-platform/docs/images/use-case/MLFlow_experiment_tracking.png b/best-practices/ml-platform/docs/images/use-case/MLFlow_experiment_tracking.png
new file mode 100644
index 000000000..a0c19aeea
Binary files /dev/null and b/best-practices/ml-platform/docs/images/use-case/MLFlow_experiment_tracking.png differ
diff --git a/best-practices/ml-platform/docs/images/use-case/MLOps_e2e.png b/best-practices/ml-platform/docs/images/use-case/MLOps_e2e.png
new file mode 100644
index 000000000..fb5a573f7
Binary files /dev/null and b/best-practices/ml-platform/docs/images/use-case/MLOps_e2e.png differ
diff --git a/best-practices/ml-platform/docs/images/use-case/TensorBoard.png b/best-practices/ml-platform/docs/images/use-case/TensorBoard.png
new file mode 100644
index 000000000..b185a46be
Binary files /dev/null and b/best-practices/ml-platform/docs/images/use-case/TensorBoard.png differ
diff --git a/best-practices/ml-platform/docs/images/ray-dataprocessing-workflow.png b/best-practices/ml-platform/docs/images/use-case/data-processing-ray-workflow.png
similarity index 100%
rename from best-practices/ml-platform/docs/images/ray-dataprocessing-workflow.png
rename to best-practices/ml-platform/docs/images/use-case/data-processing-ray-workflow.png
diff --git a/best-practices/ml-platform/docs/images/use-case/dataset_info.png b/best-practices/ml-platform/docs/images/use-case/dataset_info.png
new file mode 100644
index 000000000..98b2ec611
Binary files /dev/null and b/best-practices/ml-platform/docs/images/use-case/dataset_info.png differ
diff --git a/best-practices/ml-platform/docs/images/use-case/mlflow_epoch_loss.png b/best-practices/ml-platform/docs/images/use-case/mlflow_epoch_loss.png
new file mode 100644
index 000000000..5f333afcb
Binary files /dev/null and b/best-practices/ml-platform/docs/images/use-case/mlflow_epoch_loss.png differ
diff --git a/best-practices/ml-platform/docs/images/use-log-based-metrics.png b/best-practices/ml-platform/docs/images/use-log-based-metrics.png
new file mode 100644
index 000000000..819497edc
Binary files /dev/null and b/best-practices/ml-platform/docs/images/use-log-based-metrics.png differ
diff --git a/best-practices/ml-platform/docs/platform/playground/architecture.md b/best-practices/ml-platform/docs/platform/playground/architecture.md
index d70fc9656..b2daf5715 100644
--- a/best-practices/ml-platform/docs/platform/playground/architecture.md
+++ b/best-practices/ml-platform/docs/platform/playground/architecture.md
@@ -1,5 +1,7 @@
# Playground Machine learning platform (MLP) on GKE: Architecture
+![Playground Architecture](/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture.svg)
+
## Platform
- [Google Cloud Project](https://console.cloud.google.com/cloud-resource-manager)
diff --git a/best-practices/ml-platform/docs/platform/products-and-features.md b/best-practices/ml-platform/docs/platform/products-and-features.md
index 27382a9bb..8615c5d68 100644
--- a/best-practices/ml-platform/docs/platform/products-and-features.md
+++ b/best-practices/ml-platform/docs/platform/products-and-features.md
@@ -2,7 +2,9 @@
This document outlines the products and features that are used in the platform.
-## Cloud Logging
+![Playground Architecture](/best-practices/ml-platform/docs/images/platform/playground/mlp_playground_architecture_full.svg)
+
+## Cloud Logging
Cloud Logging is a real-time log-management system with storage, search, analysis, and monitoring support. Cloud Logging automatically collects logs from Google Cloud resources. You can also collect logs from your applications, on-premise resources, and resources from other cloud providers. You can also configure alerting policies so that Cloud Monitoring notifies you if certain kinds of events are reported in your logs. For regulatory or security reasons, you can determine where your log data is stored.
@@ -89,11 +91,11 @@ GKE Dataplane V2 observability offers the following troubleshooting tools:
For more information see the [Dataplane V2 observability documentation](https://cloud.google.com/kubernetes-engine/docs/concepts/about-dpv2-observability)
-### Filestore CSI driver
+### Filestore CSI driver
The Filestore CSI driver is the primary way to use Filestore instances with GKE. The CSI driver provides a fully-managed experience powered by the open source Google Cloud Filestore CSI driver.
-The CSI driver version is tied to Kubernetes minor version numbers and is typically the latest driver available at the time that the Kubernetes minor version is released. The drivers update automatically when the cluster is upgraded to the latest GKE patch.
+The CSI driver version is tied to Kubernetes minor version numbers and is typically the latest driver available at the time that the Kubernetes minor version is released. The drivers update automatically when the cluster is upgraded to the latest GKE patch.
For more information see the [Filestore CSI driver](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/filestore-csi-driver)
diff --git a/best-practices/ml-platform/docs/use-case/fine-tuning/end-2-end-narrative.md b/best-practices/ml-platform/docs/use-case/fine-tuning/end-2-end-narrative.md
new file mode 100644
index 000000000..b5c0bf565
--- /dev/null
+++ b/best-practices/ml-platform/docs/use-case/fine-tuning/end-2-end-narrative.md
@@ -0,0 +1,226 @@
+## Introduction
+
+This document presents a comprehensive guide to implementing the ML use case from start to finish. It explores the key components involved, outlines the step-by-step process flow, and highlights crucial considerations for successful implementation.
+
+## Use case
+
+The goal is to build a Virtual Retail Assistant \- a chat bot that is contextually sensitive to a retailer's product catalog, capable of responding to customer inquiries and suggesting alternatives and complimentary outfits.
+
+An instruction tuned base model can help with the conversation, but further fine-tuning is necessary to include retailer’s product catalog data and to respond in an instructed way. Throughout this process, we will explore:
+
+* Data Set Preparation
+* Fine-Tuning of the Instruction Tuned Model
+* Hyperparameter Tuning to Generate Multiple Model Iterations
+* Model Validation, Evaluation, and Identification of the Optimal Model
+
+Within each topic, we will share methodologies, frameworks, tools, and lessons learned to support your journey.
+
+
+## Process Flow for Implementing the ML Usecase End to End
+
+
+![MLOps workflow](/best-practices/ml-platform/docs/images/use-case/MLOps_e2e.png)
+
+
+## Data Preprocessing
+
+The data preprocessing phase in MLOps is foundational. It directly impacts the quality and performance of machine learning models. Preprocessing includes tasks such as data cleaning, feature engineering, scaling, and encoding, all of which are essential for ensuring that models learn effectively from the data.
+
+### Dataset
+
+We are leveraging a [pre-crawled public dataset](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products), taken as a subset (20000) of a bigger dataset (more than 5.8 million products) that was created by extracting data from [Flipkart](https://www.flipkart.com/), a leading Indian eCommerce store.
+
+![Dataset Snapshot](/best-practices/ml-platform/docs/images/use-case/dataset_info.png)
+
+### Data Preprocessing Steps
+
+The dataset has 15 different columns. The columns of our interest are: 'uniq\_id', 'product\_name', 'description', 'brand' ,'product\_category\_tree', 'image' ,'product\_specifications'.
+
+Other than dropping null values and duplicates, we will do following steps:
+
+1. description: Cleaning up Product Description by removing stop words & punctuations
+2. product\_category\_tree: Split into different columns
+3. product\_specifications: Parse the Product Specifications into Key:Value pairs
+4. image: Parse the list of image urls. Validate the URL and download the image.
+
+Now, consider the scenario where a preprocessing task involves extracting multiple image URLs from each row of a large dataset and uploading the images to a Google Cloud Storage (GCS) bucket. This might sound straightforward, but with a dataset containing 20,000+ rows, each with potentially up to seven URLs, the process can become incredibly time-consuming when executed serially in Python. In our experience, such a task can take upwards of eight hours to complete.
+
+### Parallelism: The Solution to Scalability
+
+To tackle this scalability issue, we turn to parallelism. By breaking the dataset into smaller chunks and distributing the processing across multiple threads or processes, we can drastically reduce the overall execution time.
+
+For implementaion steps, please check this document [Distributed Data Preprocessing with Ray](/best-practices/ml-platform/examples/use-case/data-processing/ray/README.md)
+
+
+## Data Preparation
+
+For our use case, we are fine tuning the Gemma2 9B Instruction Tuned model to align it with our specific requirements. The instruction-tuned nature of this model mandates that input data adheres to a natural language conversational format. However, our preprocessed product catalog is currently structured in a tabular format, necessitating a data preparation phase to construct suitable prompts.
+
+These prompts will consist of a range of potential questions that an online shopper might pose, alongside corresponding answers derived from the product catalog. The answers will be presented in a contextually relevant manner, simulating the interaction between a shopper and a knowledgeable assistant within an e-commerce environment.
+
+
+
+**Methodology:**
+
+The dataset underwent a filtering process to concentrate exclusively on clothing items, catering specifically to Women, Men, and Kids. Products with less than ten units in a specific category were excluded.
+
+To optimize instruction-tuned models, it is imperative that each data point consists of three distinct components: Question, Answer and Context
+
+The Gemini Flash model (through Vertex AI) was used for the generation of natural shopping-style conversational questions/Inputs and answers/Outputs. The formulation of these Input and Output was based on the product specifications detailed within the catalog.
+
+The prompts were created utilizing the previously generated questions and answers. The product category is used to set the context of the conversation.
+
+**Prompt Sample:**
+
+```
+user
+ Context:Online shopping for Men's Clothing
+What kind of material is the SKOOKIE Sleeveless Solid Men's Jacket made of?model
+The SKOOKIE Sleeveless Solid Men's Jacket is made of a blinded fabric.
+ Product Name: SKOOKIE Sleeveless Solid Men's Jacket
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Sleeveless
+- Reversible: No
+- Fabric: Blinded
+- Pattern: Solid
+- Ideal For: Men's
+- Style Code: SKMJ-3005-C-Blue
+
+```
+
+The ‘End Of Sequence’ token was appended to each prompt.
+
+EOS\_TOKEN \= '\'
+
+For implementaion steps, please check this document [Data preparation for fine tuning Gemma IT model](/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/README.md)
+
+
+## Fine-tuning
+
+With our prepared dataset, we proceed to fine-tune the Gemma model using PyTorch's Low-Rank Adaptation (LoRA) within the Parameter-Efficient Fine-Tuning (PEFT) framework. This approach incorporates Transformer Reinforcement Learning's (TRL) Supervised Fine-Tuning (SFT) methodology. The SFT process is initiated by providing the *`SFTrainer`* with both the dataset and the specific tuning parameter configurations.
+
+While single-GPU setups without quantization might be sufficient, scaling to multiple GPUs across multiple hosts necessitates sharding and distributed workload management. We employ Fully Sharded Data Parallelism ([FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/)) for model sharding and leverage [Accelerate](https://huggingface.co/docs/accelerate/en/index) for efficient distributed training.
+
+### **Initial Fine-Tuning and Optimization**
+
+The initial fine-tuning run utilized a baseline set of parameters. Accelerate aggregates data across all shards and saves the model to a mounted Google Cloud Storage (GCS) bucket for convenient access and persistence.
+
+Having established a working baseline, our focus shifts to optimization. Our goal is to minimize loss while ensuring the model achieves a satisfactory level of accuracy. We'll systematically adjust hyperparameters such as learning rate, the number of training epochs to identify the optimal configuration for our specific task and dataset.
+
+## Hyperparameter Tuning to Generate Multiple Model Iterations
+
+Fine-tuning large language models like Gemma involves carefully adjusting various hyperparameters to optimize performance and tailor the model's behavior for specific tasks. These hyperparameters control aspects of the model's architecture, training process, and optimization algorithm.
+
+In our fine-tuning process, we focused on the following key parameters:
+
+### **Core LoRA Parameters**
+
+* **lora\_r (Rank):** This parameter controls the rank of the low-rank matrices used in LoRA. A higher rank increases the expressiveness of the model but also increases the number of trainable parameters.
+* **lora\_alpha (Alpha):** This scaling factor balances the contribution of the LoRA updates to the original model weights. A larger alpha can lead to faster adaptation but may also introduce instability.
+* **lora\_dropout (Dropout):** Dropout is a regularization technique that helps prevent overfitting. Applying dropout to LoRA layers can further improve generalization.
+
+### **Learning Rate, Optimization, and Training Duration**
+
+* **learning\_rate (Learning Rate):** The learning rate determines the step size the optimizer takes during training. Finding the optimal learning rate is crucial for efficient convergence and avoiding overshooting.
+* **epochs (Number of Epochs):** The number of epochs dictates how many times the model sees the entire training dataset. More epochs generally lead to better performance, but overfitting can become a concern.
+* **max\_grad\_norm (Maximum Gradient Norm):** Gradient clipping helps prevent exploding gradients, which can destabilize training. This parameter limits the maximum norm of the gradients.
+* **weight\_decay (Weight Decay):** Weight decay is a regularization technique that adds a penalty to the loss function based on the magnitude of the model weights. This helps prevent overfitting.
+* **warmup\_ratio (Warmup Ratio):** This parameter determines the proportion of the total training steps during which the learning rate is gradually increased from zero to its initial value. Warmup can improve stability and convergence.
+
+### **Sequence Length Considerations**
+
+* **max\_seq\_length (Maximum Sequence Length):** This parameter controls the maximum length of the input sequences the model can process. Longer sequences can provide more context but also require more computational resources.
+
+### Experimental Values
+
+The following table displays the experimental values across multiple jobs we used for the aforementioned parameters:
+
+| parameter | base | job-0 | job-1 | job-2 |
+| :---- | ----- | ----- | ----- | ----- |
+| epochs | 1 | 2 | 3 | 4 |
+| lora\_r | 8 | 8 | 16 | 32 |
+| lora\_alpha | 16 | 16 | 32 | 64 |
+| lora\_dropout | 0.1 | 0.1 | 0.2 | 0.3 |
+| max\_grad\_norm | 0.3 | 1 | 1 | 1 |
+| learning\_rate | 2.00E+04 | 2.00E+05 | 2.00E+04 | 3.00E+04 |
+| weight\_decay | 0.001 | 0.01 | 0.005 | 0.001 |
+| warmup\_ratio | 0.03 | 0.1 | 0.2 | 0.3 |
+| max\_seq\_length | 512 | 1024 | 2048 | 8192 |
+
+### Monitoring Hyperparamer tuning jobs
+
+Tracking multiple hyperparameter tuning jobs with varying parameters and metrics requires a mechanism to track input parameters, monitor the training progress and share the results. Custom tools can be developed, but there are also open-source options like [MLflow](https://www.mlflow.org/).
+
+MLflow is an open-source platform for tracking experiments, including hyperparameter tuning. It provides a central repository for storing and organizing experimental data, making it easy to compare and analyze different configurations and results. MLflow was deployed onto the platform cluster to track our experiments.
+
+**Incorporating MLflow**
+
+Incorporating MLflow into your code is straightforward. Import the MLflow library and set the tracking server URI. Enable system metrics logging by setting the MLFLOW\_ENABLE\_SYSTEM\_METRICS\_LOGGING environment variable and installing the psutil and pynvml libraries.
+
+```py
+import mlflow
+
+mlflow.set_tracking_uri(remote_server_uri)
+mlflow.autolog()
+```
+
+This will allow MLflow to track fine-tuning parameters, results, and system-level metrics like CPU, memory, and GPU utilization. This added layer of monitoring provides a comprehensive view of batch fine-tuning jobs, making it easier to compare different configurations and results.
+
+![epoch_vs_loss](/best-practices/ml-platform/docs/images/use-case/mlflow_epoch_loss.png)
+
+![ml_flow_](/best-practices/ml-platform/docs/images/use-case/MLFlow_experiment_tracking.png)
+
+
+
+Alternative solutions, such as MLflow and Weights & Biases, offer additional capabilities. While MLflow provides comprehensive pipeline features, our immediate requirements are satisfied by its core tracking functionality.
+
+Integrating MLflow was relatively simple, involving the deployment of a tracking server and a local SQLite database. While the file-based approach is swift and straightforward, it becomes inefficient at scale. In a production-grade deployment, organizations may opt for a managed Postgres database (e.g., Cloud SQL, AlloyDB) to host MLflow data.
+
+## Model evaluation and validation
+
+After each fine-tuning cycle, the model must be validated for precision and accuracy. This assessment involves a two-stage process:
+
+### Prediction Stage
+
+* The fine-tuned model, loaded into vLLM for efficient serving, is presented with prompts derived from the test dataset.
+* Model outputs (predictions) are captured and stored for subsequent analysis.
+
+### Accuracy Calculation
+
+Accuracy calculation will be performed on the predictions by comparing the responses from the model with the product data from the training dataset.
+
+Key metrics that this task will output the following fields:
+
+* **Training Data Set Size:** The total number of samples in the training dataset.
+* **Data Match:** The count of predictions that exactly align with the ground truth.
+* **False Positives:** The count of predictions where the model incorrectly recommended a product.
+* **No Product Details:** The count of instances where the model failed to extract product details when they were present in the training dataset
+* **Calculated Accuracy:** The overall proportion of correct predictions made by the model.
+* **Calculated Precision:** The proportion of positive predictions made by the model that were actually correct.
+
+### Our Results
+
+| Model | Job | Training Data Size | Test Data Size | True Positives | False Positives | No Product Details | Accuracy (match/total) | Precision (match/(total-false)) |
+| :---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
+| Gemma 1.1 7b-it | 1 | 3902 | 976 | 815 | 160 | 1 | 83.50% | 83.59% |
+| Gemma 1.1 7b-it | 2 | 3902 | 976 | 737 | 225 | 14 | 75.51% | 76.61% |
+| Gemma 2 9b-it | 1 | 3902 | 976 | 836 | 137 | 4 | 85.66% | 86.01% |
+| Gemma 2 9b-it | 2 | 3902 | 976 | 812 | 164 | 0 | 83.20% | 83.20% |
+| Gemma 2 9b-it | 1 | 12838 | 3210 | 2838 | 372 | 0 | 88.41% | 88.41% |
+| Gemma 2 9b-it | 2 | 12838 | 3210 | 2482 | 719 | 9 | 77.32% | 77.54% |
+
+### Interpreting Results and Decision-Making
+
+Above evaluation provides a granular understanding of the model's performance. By examining the accuracy, precision, and the distribution of errors (false positives, no product details), the team can make informed decisions about the model's suitability for deployment:
+
+* **Acceptable Accuracy Threshold:** Determine a minimum accuracy level required for the model to be considered effective.
+* **Error Analysis:** Investigate the types of errors the model makes to identify potential areas for improvement in future fine-tuning iterations.
+* **Deployment Decision:** If the model meets or exceeds the predetermined criteria, it can be confidently deployed for real-world use. If not, further refinement or adjustments may be necessary.
+
+**Additional Considerations**
+
+* Ensure the test dataset accurately reflects the real-world data distribution the model will encounter in production.
+* Consider the size of the test dataset to assess the statistical significance of the results.
+
+For implementaion steps, please check this document [Fine tuning Gemma IT model](/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/README.md)
diff --git a/best-practices/ml-platform/examples/platform/playground/README.md b/best-practices/ml-platform/examples/platform/playground/README.md
index d40b57da4..494f6cf03 100644
--- a/best-practices/ml-platform/examples/platform/playground/README.md
+++ b/best-practices/ml-platform/examples/platform/playground/README.md
@@ -49,13 +49,13 @@ The default quota given to a project should be sufficient for this guide.
```
export MLP_BASE_DIR=$(pwd) && \
- echo "export MLP_BASE_DIR=${MLP_BASE_DIR}" >> ${HOME}/.bashrc
+ sed -n -i -e '/^export MLP_BASE_DIR=/!p' -i -e '$aexport MLP_BASE_DIR="'"${MLP_BASE_DIR}"'"' ${HOME}/.bashrc
```
```
cd examples/platform/playground && \
export MLP_TYPE_BASE_DIR=$(pwd) && \
- echo "export MLP_TYPE_BASE_DIR=${MLP_TYPE_BASE_DIR}" >> ${HOME}/.bashrc
+ sed -n -i -e '/^export MLP_TYPE_BASE_DIR=/!p' -i -e '$aexport MLP_TYPE_BASE_DIR="'"${MLP_TYPE_BASE_DIR}"'"' ${HOME}/.bashrc
```
## GitHub Configuration
@@ -115,7 +115,6 @@ The default quota given to a project should be sufficient for this guide.
export MLP_GIT_NAMESPACE=""
export MLP_GIT_USER_EMAIL=""
export MLP_GIT_USER_NAME=""
-
```
- Set the configuration variables
@@ -140,7 +139,7 @@ You only need to complete the section for the option that you have selected (eit
```
export MLP_PROJECT_ID=""
- export MLP_STATE_BUCKET="${MLP_PROJECT_ID}-tf-state"
+ export MLP_STATE_BUCKET="${MLP_PROJECT_ID}-terraform"
```
- Set the default `gcloud` project
@@ -283,12 +282,13 @@ Before running Terraform, make sure that the Service Usage API is enable.
`gcloud services enable serviceusage.googleapis.com`
-- Ensure the endpoint is not in a deleted state
+- Ensure the endpoints are not in a deleted state
```
MLP_ENVIRONMENT_NAME=$(grep environment_name ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs)
MLP_PROJECT_ID=$(grep environment_project_id ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs)
gcloud endpoints services undelete ray-dashboard.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog --quiet 2>/dev/null
+ gcloud endpoints services undelete mlflow-tracking.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog --quiet 2>/dev/null
```
- Create the resources
@@ -303,6 +303,15 @@ Before running Terraform, make sure that the Service Usage API is enable.
See [Create resources errors](#create-resources-errors) in the Troubleshooting section if the apply does not complete successfully.
+- Create your environment configuration file
+
+ ```
+ export MLP_ENVIRONMENT_FILE="${HOME}/mlp-${MLP_PROJECT_ID}-${MLP_ENVIRONMENT_NAME}.env" && \
+ sed -n -i -e '/^export MLP_ENVIRONMENT_FILE=/!p' -i -e '$aexport MLP_ENVIRONMENT_FILE="'"${MLP_ENVIRONMENT_FILE}"'"' ${HOME}/.bashrc && \
+ terraform output -raw environment_configuration > ${MLP_ENVIRONMENT_FILE} && \
+ source ${MLP_ENVIRONMENT_FILE}
+ ```
+
## Review the resources
### GKE clusters and ConfigSync
@@ -315,16 +324,14 @@ Before running Terraform, make sure that the Service Usage API is enable.
### Software installed via RepoSync and RootSync
-Open Cloud Shell to execute the following commands:
-
-- Store your GKE cluster name in env variable:
+For the playground configuration, [Ray](https://docs.ray.io/en/latest/cluster/kubernetes/index.html) and [MLflow](https://mlflow.org/) are installed by default.
- `export GKE_CLUSTER=`
+You can check the installation by executing the following commands in Cloud Shell:
- Get cluster credentials:
```
- gcloud container fleet memberships get-credentials ${GKE_CLUSTER}
+ gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME}
```
The output will be similar to the following:
@@ -362,28 +369,29 @@ Open Cloud Shell to execute the following commands:
kuberay-operator-56b8d98766-2nvht 1/1 Running 0 6m26s
```
-- Check the namespace `ml-team` created:
+- Check the namespace created:
```
- kubectl get ns | grep ml-team
+ kubectl get ns ${MLP_KUBERNETES_NAMESPACE}
```
The output will be similar to the following:
```
- ml-team Active 21m
+ NAME STATUS AGE
+ ml-team Active ##m
```
-- Check the RepoSync object created `ml-team` namespace:
+- Check the RepoSync object created for the namespace:
```
- kubectl get reposync -n ml-team
+ kubectl get reposync -n ${MLP_KUBERNETES_NAMESPACE}
```
-- Check the `raycluster` in `ml-team` namespace
+- Check the `raycluster` in the namespace
```
- kubectl get raycluster -n ml-team
+ kubectl get raycluster -n ${MLP_KUBERNETES_NAMESPACE}
```
The output will be similar to the following:
@@ -393,10 +401,10 @@ Open Cloud Shell to execute the following commands:
ray-cluster-kuberay 1 1 ready 29m
```
-- Check the head and worker pods of kuberay in `ml-team` namespace
+- Check the head and worker pods of kuberay in the namespace
```
- kubectl get pods -n ml-team
+ kubectl get pods -n ${MLP_KUBERNETES_NAMESPACE}
```
The output will be similar to the following:
@@ -405,19 +413,24 @@ Open Cloud Shell to execute the following commands:
NAME READY STATUS RESTARTS AGE
ray-cluster-kuberay-head-sp6dg 2/2 Running 0 3m21s
ray-cluster-kuberay-worker-workergroup-rzpjw 2/2 Running 0 3m21s
+ mlflow-tracking-6f9bb844f9-4749n 2/2 Running 0 3m13s
```
-- Open the `ml-team` Ray dashboard
+- Open the namespace's Ray dashboard
```
- MLP_ENVIRONMENT_NAME=$(grep environment_name ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs)
- MLP_PROJECT_ID=$(grep environment_project_id ${MLP_TYPE_BASE_DIR}/mlp.auto.tfvars | awk -F"=" '{print $2}' | xargs)
- echo -e "\nml-team Ray dashboard: https://ray-dashboard.ml-team.mlp-${MLP_ENVIRONMENT_NAME}.endpoints.${MLP_PROJECT_ID}.cloud.goog\n"
+ echo -e "\n${MLP_KUBERNETES_NAMESPACE} Ray dashboard: ${MLP_RAY_DASHBOARD_NAMESPACE_ENDPOINT}\n"
+ ```
+
+- Open the namespace's MLFlow Tracking server
+
+ ```
+ echo -e "\n${MLP_KUBERNETES_NAMESPACE} MLFlow Tracking URL: ${MLP_MLFLOW_TRACKING_NAMESPACE_ENDPOINT}\n"
```
- > If you get `ERR_CONNECTION_CLOSED` or `ERR_CONNECTION_RESET` when trying to go to the Ray dashboard, the [Gateway](https://console.cloud.google.com/kubernetes/gateways) is still being provisioned. Retry in a couple of minutes.
+ > If you get `ERR_CONNECTION_CLOSED` or `ERR_CONNECTION_RESET` when trying to go to the links, the [Gateway](https://console.cloud.google.com/kubernetes/gateways) is still being provisioned. Retry in a couple of minutes.
- > If you get `ERR_SSL_VERSION_OR_CIPHER_MISMATCH` when trying to go to the Ray dashboard, the [SSL certificate](https://console.cloud.google.com/security/ccm/list/lbCertificates) is still being provisioned. Retry in a couple of minutes.
+ > If you get `ERR_SSL_VERSION_OR_CIPHER_MISMATCH` when trying to go to the links, the [SSL certificate](https://console.cloud.google.com/security/ccm/list/lbCertificates) is still being provisioned. Retry in a couple of minutes.
## Cleanup
@@ -458,7 +471,21 @@ You only need to complete the section for the option that you have selected.
gsutil -m rm -rf gs://${TERRAFORM_BUCKET_NAME}/* && \
terraform init && \
terraform destroy -auto-approve && \
- rm -rf .terraform .terraform.lock.hcl
+ rm -rf .terraform .terraform.lock.hcl state/
+ ```
+
+### Environment configuration
+
+- Delete the environment configuration file
+
+ ```
+ rm -f ${MLP_ENVIRONMENT_FILE}
+ ```
+
+- Remove the environment configuration environment variable
+
+ ```
+ sed -i -e '/^export MLP_ENVIRONMENT_FILE=/d' ${HOME}/.bashrc
```
### Code Repository
@@ -488,6 +515,15 @@ You only need to complete the section for the option that you have selected.
terraform/features/initialize/state
```
+- Remove the environment variables
+
+ ```
+ sed \
+ -i -e '/^export MLP_BASE_DIR=/d' \
+ -i -e '/^export MLP_TYPE_BASE_DIR=/d' \
+ ${HOME}/.bashrc
+ ```
+
## Troubleshooting
### Create resources errors
diff --git a/best-practices/ml-platform/examples/platform/playground/container_cluster.tf b/best-practices/ml-platform/examples/platform/playground/container_cluster.tf
index cf56b98b2..b3b716443 100644
--- a/best-practices/ml-platform/examples/platform/playground/container_cluster.tf
+++ b/best-practices/ml-platform/examples/platform/playground/container_cluster.tf
@@ -13,6 +13,7 @@
# limitations under the License.
locals {
+ cluster_name = "${var.cluster_name_prefix}-${var.environment_name}"
# Minimal roles for nodepool SA https://cloud.google.com/kubernetes-engine/docs/how-to/hardening-your-cluster#use_least_privilege_sa
cluster_sa_roles = [
"roles/monitoring.viewer",
@@ -28,9 +29,9 @@ locals {
# Create dedicated service account for node pools
resource "google_service_account" "cluster" {
project = data.google_project.environment.project_id
- account_id = "vm-${var.cluster_name}-${var.environment_name}"
- display_name = "${var.cluster_name}-${var.environment_name} Service Account"
- description = "Terraform-managed service account for cluster ${var.cluster_name}-${var.environment_name}"
+ account_id = "vm-${local.cluster_name}"
+ display_name = "${local.cluster_name} Service Account"
+ description = "Terraform-managed service account for cluster ${local.cluster_name}"
}
# Bind minimum role list + additional roles to nodepool SA on project
@@ -48,7 +49,7 @@ resource "google_container_cluster" "mlp" {
deletion_protection = false
enable_shielded_nodes = true
location = var.subnet_01_region
- name = "${var.cluster_name}-${var.environment_name}"
+ name = local.cluster_name
network = module.create-vpc.vpc
project = data.google_project.environment.project_id
remove_default_node_pool = false
@@ -186,10 +187,13 @@ resource "google_container_cluster" "mlp" {
enable_components = [
"APISERVER",
+ "CADVISOR",
"CONTROLLER_MANAGER",
"DAEMONSET",
+ "DCGM",
"DEPLOYMENT",
"HPA",
+ "KUBELET",
"POD",
"SCHEDULER",
"STATEFULSET",
diff --git a/best-practices/ml-platform/examples/platform/playground/container_node_pool.tf b/best-practices/ml-platform/examples/platform/playground/container_node_pool.tf
index 2abe2e1b8..28a637b0b 100644
--- a/best-practices/ml-platform/examples/platform/playground/container_node_pool.tf
+++ b/best-practices/ml-platform/examples/platform/playground/container_node_pool.tf
@@ -16,7 +16,7 @@
# Available zones: https://cloud.google.com/compute/docs/regions-zones#available
###############################################################################
resource "google_container_node_pool" "cpu_n4s8" {
- depends_on = [google_container_cluster.mlp]
+ depends_on = [google_gke_hub_membership.cluster]
# Variables
cluster = google_container_cluster.mlp.name
@@ -89,7 +89,7 @@ resource "google_container_node_pool" "cpu_n4s8" {
# Available zones: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones#view-using-table
###############################################################################
resource "google_container_node_pool" "gpu_a100x2_a2h2" {
- depends_on = [google_container_cluster.mlp]
+ depends_on = [google_gke_hub_membership.cluster]
# Variables
cluster = google_container_cluster.mlp.name
@@ -176,7 +176,7 @@ resource "google_container_node_pool" "gpu_a100x2_a2h2" {
###############################################################################
resource "google_container_node_pool" "gpu_a100x2_a2h2_dws" {
- depends_on = [google_container_cluster.mlp]
+ depends_on = [google_gke_hub_membership.cluster]
# Variables
cluster = google_container_cluster.mlp.name
@@ -267,7 +267,7 @@ resource "google_container_node_pool" "gpu_a100x2_a2h2_dws" {
###############################################################################
resource "google_container_node_pool" "gpu_h100x8_a3h8_dws" {
- depends_on = [google_container_cluster.mlp]
+ depends_on = [google_gke_hub_membership.cluster]
# Variables
cluster = google_container_cluster.mlp.name
@@ -360,7 +360,7 @@ resource "google_container_node_pool" "gpu_h100x8_a3h8_dws" {
###############################################################################
resource "google_container_node_pool" "gpu_l4x2_g2s24" {
- depends_on = [google_container_cluster.mlp]
+ depends_on = [google_gke_hub_membership.cluster]
# Variables
cluster = google_container_cluster.mlp.name
@@ -445,7 +445,7 @@ resource "google_container_node_pool" "gpu_l4x2_g2s24" {
###############################################################################
resource "google_container_node_pool" "gpu_l4x2_g2s24_dws" {
- depends_on = [google_container_cluster.mlp]
+ depends_on = [google_gke_hub_membership.cluster]
# Variables
cluster = google_container_cluster.mlp.name
@@ -534,7 +534,7 @@ resource "google_container_node_pool" "gpu_l4x2_g2s24_dws" {
###############################################################################
resource "google_container_node_pool" "gpu_l4x2_g2s24_spot" {
- depends_on = [google_container_cluster.mlp]
+ depends_on = [google_gke_hub_membership.cluster]
# Variables
cluster = google_container_cluster.mlp.name
diff --git a/best-practices/ml-platform/examples/platform/playground/fleet.tf b/best-practices/ml-platform/examples/platform/playground/fleet.tf
index c60fb36e8..8b5678a6c 100644
--- a/best-practices/ml-platform/examples/platform/playground/fleet.tf
+++ b/best-practices/ml-platform/examples/platform/playground/fleet.tf
@@ -47,8 +47,9 @@ resource "null_resource" "gke_hub_feature_configmanagement" {
resource "google_gke_hub_membership" "cluster" {
depends_on = [
+ google_gke_hub_membership.cluster,
google_project_service.gkeconnect_googleapis_com,
- google_project_service.gkehub_googleapis_com
+ google_project_service.gkehub_googleapis_com,
]
membership_id = google_container_cluster.mlp.name
diff --git a/best-practices/ml-platform/examples/platform/playground/gateway.tf b/best-practices/ml-platform/examples/platform/playground/gateway.tf
index 1798cec2b..f50657fbb 100644
--- a/best-practices/ml-platform/examples/platform/playground/gateway.tf
+++ b/best-practices/ml-platform/examples/platform/playground/gateway.tf
@@ -13,14 +13,17 @@
# limitations under the License.
locals {
- hostname_suffix = "endpoints.${data.google_project.environment.project_id}.cloud.goog"
- gateway_manifests_directory = "${path.module}/manifests/${var.environment_name}/${var.namespace}/gateway"
- gateway_name = "external-https"
- iap_domain = var.iap_domain != null ? var.iap_domain : split("@", trimspace(data.google_client_openid_userinfo.identity.email))[1]
- iap_oath_brand = "projects/${data.google_project.environment.number}/brands/${data.google_project.environment.number}"
- ray_head_service_name = "ray-cluster-kuberay-head-svc"
- ray_dashboard_endpoint = "ray-dashboard.${data.kubernetes_namespace_v1.team.metadata[0].name}.mlp-${var.environment_name}.${local.hostname_suffix}"
- ray_dashboard_port = 8265
+ hostname_suffix = "endpoints.${data.google_project.environment.project_id}.cloud.goog"
+ gateway_manifests_directory = "${path.module}/manifests/${var.environment_name}/${var.namespace}/gateway"
+ gateway_name = "external-https"
+ iap_domain = var.iap_domain != null ? var.iap_domain : split("@", trimspace(data.google_client_openid_userinfo.identity.email))[1]
+ iap_oath_brand = "projects/${data.google_project.environment.number}/brands/${data.google_project.environment.number}"
+ ray_head_service_name = "ray-cluster-kuberay-head-svc"
+ ray_dashboard_endpoint = "ray-dashboard.${data.kubernetes_namespace_v1.team.metadata[0].name}.mlp-${var.environment_name}.${local.hostname_suffix}"
+ ray_dashboard_port = 8265
+ mlflow_tracking_endpoint = "mlflow-tracking.${data.kubernetes_namespace_v1.team.metadata[0].name}.mlp-${var.environment_name}.${local.hostname_suffix}"
+ mlflow_tracking_service_name = "mlflow-tracking-svc"
+ mlflow_tracking_port = 5000
}
###############################################################################
@@ -42,7 +45,7 @@ resource "google_compute_managed_ssl_certificate" "external_gateway" {
project = data.google_project.environment.project_id
managed {
- domains = [local.ray_dashboard_endpoint]
+ domains = [local.ray_dashboard_endpoint, local.mlflow_tracking_endpoint]
}
}
@@ -67,6 +70,18 @@ resource "google_endpoints_service" "ray_dashboard_https" {
service_name = local.ray_dashboard_endpoint
}
+resource "google_endpoints_service" "mlflow_tracking_https" {
+ openapi_config = templatefile(
+ "${path.module}/templates/openapi/endpoint.tftpl.yaml",
+ {
+ endpoint = local.mlflow_tracking_endpoint,
+ ip_address = google_compute_global_address.external_gateway_https.address
+ }
+ )
+ project = data.google_project.environment.project_id
+ service_name = local.mlflow_tracking_endpoint
+}
+
resource "local_file" "gateway_external_https_yaml" {
content = templatefile(
"${path.module}/templates/gateway/gateway-external-https.tftpl.yaml",
@@ -93,6 +108,20 @@ resource "local_file" "route_ray_dashboard_https_yaml" {
filename = "${local.gateway_manifests_directory}/route-ray-dashboard-https.yaml"
}
+resource "local_file" "route_mlflow_tracking_https_yaml" {
+ content = templatefile(
+ "${path.module}/templates/gateway/http-route-service.tftpl.yaml",
+ {
+ gateway_name = local.gateway_name,
+ http_route_name = "mlflow-tracking-https",
+ hostname = local.mlflow_tracking_endpoint
+ service_name = local.mlflow_tracking_service_name
+ service_port = local.mlflow_tracking_port
+ }
+ )
+ filename = "${local.gateway_manifests_directory}/route-mlflow-tracking-https.yaml"
+}
+
###############################################################################
# IAP
###############################################################################
diff --git a/best-practices/ml-platform/examples/platform/playground/gitops_configsync.tf b/best-practices/ml-platform/examples/platform/playground/gitops_configsync.tf
index 5414f4b1f..25afa5f77 100644
--- a/best-practices/ml-platform/examples/platform/playground/gitops_configsync.tf
+++ b/best-practices/ml-platform/examples/platform/playground/gitops_configsync.tf
@@ -16,6 +16,7 @@ locals {
namespace_default_kubernetes_service_account = "default"
ray_head_kubernetes_service_account = "ray-head"
ray_worker_kubernetes_service_account = "ray-worker"
+ mlflow_kubernetes_service_account = "mlflow"
}
@@ -24,16 +25,16 @@ locals {
###############################################################################
resource "null_resource" "template_manifests" {
depends_on = [
- google_gke_hub_feature_membership.cluster_configmanagement
+ google_gke_hub_feature_membership.cluster_configmanagement,
+ google_secret_manager_secret_version.git_config,
]
provisioner "local-exec" {
command = "${path.module}/scripts/template_manifests.sh"
environment = {
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ PROJECT_ID = data.google_project.environment.project_id
}
}
@@ -50,18 +51,18 @@ resource "null_resource" "template_manifests" {
resource "null_resource" "cluster_manifests" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.template_manifests
+ google_secret_manager_secret_version.git_config,
+ null_resource.template_manifests,
]
provisioner "local-exec" {
command = "${path.module}/scripts/cluster_manifests.sh"
environment = {
- CLUSTER_ENV = var.environment_name
- CLUSTER_NAME = google_container_cluster.mlp.name
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
+ CLUSTER_ENV = var.environment_name
+ CLUSTER_NAME = google_container_cluster.mlp.name
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ PROJECT_ID = data.google_project.environment.project_id
}
}
@@ -78,18 +79,18 @@ resource "null_resource" "cluster_manifests" {
resource "null_resource" "git_cred_secret_cms" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.connect_gateway_kubeconfig
+ google_secret_manager_secret_version.git_config,
+ null_resource.connect_gateway_kubeconfig,
]
provisioner "local-exec" {
command = "${path.module}/scripts/git_cred_secret.sh"
environment = {
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
- K8S_NAMESPACE = "config-management-system"
- KUBECONFIG = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ K8S_NAMESPACE = "config-management-system"
+ KUBECONFIG = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
+ PROJECT_ID = data.google_project.environment.project_id
}
}
@@ -106,16 +107,16 @@ resource "null_resource" "git_cred_secret_cms" {
resource "null_resource" "kueue" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.cluster_manifests
+ google_secret_manager_secret_version.git_config,
+ null_resource.cluster_manifests,
]
provisioner "local-exec" {
command = "${path.module}/scripts/kueue_manifests.sh"
environment = {
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ PROJECT_ID = data.google_project.environment.project_id
}
}
@@ -129,27 +130,27 @@ resource "null_resource" "kueue" {
# NVIDIA DCGM
###############################################################################
-resource "null_resource" "nvidia_dcgm" {
- depends_on = [
- google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.kueue
- ]
-
- provisioner "local-exec" {
- command = "${path.module}/scripts/nvidia_dcgm_manifests.sh"
- environment = {
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
- }
- }
-
- triggers = {
- md5_files = md5(join("", [for f in fileset("${path.module}/templates/configsync/templates/_cluster_template/gmp-public/nvidia-dcgm", "**") : md5("${path.module}/templates/configsync/templates/_cluster_template/gmp-public/nvidia-dcgm/${f}")]))
- md5_script = filemd5("${path.module}/scripts/nvidia_dcgm_manifests.sh")
- }
-}
+# resource "null_resource" "nvidia_dcgm" {
+# depends_on = [
+# google_gke_hub_feature_membership.cluster_configmanagement,
+# google_secret_manager_secret_version.git_config,
+# null_resource.kueue,
+# ]
+
+# provisioner "local-exec" {
+# command = "${path.module}/scripts/nvidia_dcgm_manifests.sh"
+# environment = {
+# GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+# GIT_REPOSITORY = local.git_repository
+# PROJECT_ID = data.google_project.environment.project_id
+# }
+# }
+
+# triggers = {
+# md5_files = md5(join("", [for f in fileset("${path.module}/templates/configsync/templates/_cluster_template/gmp-public/nvidia-dcgm", "**") : md5("${path.module}/templates/configsync/templates/_cluster_template/gmp-public/nvidia-dcgm/${f}")]))
+# md5_script = filemd5("${path.module}/scripts/nvidia_dcgm_manifests.sh")
+# }
+# }
@@ -158,17 +159,18 @@ resource "null_resource" "nvidia_dcgm" {
resource "null_resource" "kuberay_manifests" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.nvidia_dcgm,
+ google_secret_manager_secret_version.git_config,
+ null_resource.kueue,
+ #null_resource.nvidia_dcgm,
]
provisioner "local-exec" {
command = "${path.module}/scripts/kuberay_manifests.sh"
environment = {
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
- K8S_NAMESPACE = var.namespace
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ K8S_NAMESPACE = var.namespace
+ PROJECT_ID = data.google_project.environment.project_id
}
}
@@ -178,63 +180,57 @@ resource "null_resource" "kuberay_manifests" {
}
}
-
-
# NAMESPACE
###############################################################################
resource "null_resource" "namespace_manifests" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
+ google_secret_manager_secret_version.git_config,
null_resource.connect_gateway_kubeconfig,
- null_resource.kuberay_manifests
+ null_resource.kuberay_manifests,
]
provisioner "local-exec" {
command = "${path.module}/scripts/namespace_manifests.sh"
environment = {
- CLUSTER_ENV = var.environment_name
- CLUSTER_NAME = google_container_cluster.mlp.name
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
- K8S_NAMESPACE = self.triggers.namespace
+ CLUSTER_ENV = var.environment_name
+ CLUSTER_NAME = google_container_cluster.mlp.name
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ K8S_NAMESPACE = self.triggers.namespace
+ PROJECT_ID = data.google_project.environment.project_id
}
}
provisioner "local-exec" {
command = "scripts/namespace_cleanup.sh"
environment = {
- ENVIRONMENT_NAME = self.triggers.environment_name
- GIT_EMAIL = self.triggers.github_email
- GIT_REPOSITORY = self.triggers.git_repository
- GIT_TOKEN = self.triggers.github_token
- GIT_USERNAME = self.triggers.github_user
- KUBECONFIG = self.triggers.kubeconfig
- K8S_NAMESPACE = self.triggers.namespace
- PROJECT_ID = self.triggers.project_id
- REPO_SYNC_NAME = self.triggers.repo_sync_name
- REPO_SYNC_NAMESPACE = self.triggers.repo_sync_namespace
- ROOT_SYNC_NAME = self.triggers.root_sync_name
+ ENVIRONMENT_NAME = self.triggers.environment_name
+ GIT_CONFIG_SECRET_NAME = self.triggers.git_config_secret_name
+ GIT_REPOSITORY = self.triggers.git_repository
+ KUBECONFIG = self.triggers.kubeconfig
+ K8S_NAMESPACE = self.triggers.namespace
+ PROJECT_ID = self.triggers.project_id
+ REPO_SYNC_NAME = self.triggers.repo_sync_name
+ REPO_SYNC_NAMESPACE = self.triggers.repo_sync_namespace
+ ROOT_SYNC_NAME = self.triggers.root_sync_name
}
when = destroy
working_dir = path.module
}
triggers = {
- environment_name = var.environment_name
- git_repository = local.git_repository
- github_email = var.git_user_email
- github_token = var.git_token
- github_user = var.git_user_name
- kubeconfig = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
- project_id = data.google_project.environment.project_id
- md5_files = md5(join("", [for f in fileset("${path.module}/templates/configsync/templates/_cluster_template/team", "**") : md5("${path.module}/templates/configsync/templates/_cluster_template/team/${f}")]))
- md5_script = filemd5("${path.module}/scripts/namespace_manifests.sh")
- namespace = var.namespace
- repo_sync_name = "${var.environment_name}-${var.namespace}"
- repo_sync_namespace = var.namespace
- root_sync_name = "root-sync"
+ environment_name = var.environment_name
+ git_config_secret_name = local.git_config_secret_name
+ git_repository = local.git_repository
+ kubeconfig = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
+ project_id = data.google_project.environment.project_id
+ md5_files = md5(join("", [for f in fileset("${path.module}/templates/configsync/templates/_cluster_template/team", "**") : md5("${path.module}/templates/configsync/templates/_cluster_template/team/${f}")]))
+ md5_script = filemd5("${path.module}/scripts/namespace_manifests.sh")
+ namespace = var.namespace
+ repo_sync_name = "${var.environment_name}-${var.namespace}"
+ repo_sync_namespace = var.namespace
+ root_sync_name = "root-sync"
}
}
@@ -245,16 +241,17 @@ resource "null_resource" "namespace_manifests" {
resource "null_resource" "git_cred_secret_ns" {
depends_on = [
null_resource.connect_gateway_kubeconfig,
- null_resource.namespace_manifests
+ google_secret_manager_secret_version.git_config,
+ null_resource.namespace_manifests,
]
provisioner "local-exec" {
command = "${path.module}/scripts/git_cred_secret.sh"
environment = {
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
- K8S_NAMESPACE = var.namespace
- KUBECONFIG = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ K8S_NAMESPACE = var.namespace
+ KUBECONFIG = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
+ PROJECT_ID = data.google_project.environment.project_id
}
}
@@ -271,17 +268,17 @@ resource "null_resource" "git_cred_secret_ns" {
resource "null_resource" "kuberay_watch_namespace_manifests" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.namespace_manifests
+ google_secret_manager_secret_version.git_config,
+ null_resource.namespace_manifests,
]
provisioner "local-exec" {
command = "${path.module}/scripts/kuberay_watch_namespace_manifests.sh"
environment = {
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
- K8S_NAMESPACE = var.namespace
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ K8S_NAMESPACE = var.namespace
+ PROJECT_ID = data.google_project.environment.project_id
}
}
@@ -294,28 +291,30 @@ resource "null_resource" "kuberay_watch_namespace_manifests" {
# RAY CLUSTER IN NAMESPACE
###############################################################################
-resource "null_resource" "ray_cluster_namespace_manifests" {
+resource "null_resource" "cluster_namespace_manifests" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.kuberay_watch_namespace_manifests
+ google_secret_manager_secret_version.git_config,
+ null_resource.kuberay_watch_namespace_manifests,
]
provisioner "local-exec" {
- command = "${path.module}/scripts/ray_cluster_namespace_manifests.sh"
+ command = "${path.module}/scripts/cluster_namespace_manifests.sh"
environment = {
- GIT_EMAIL = var.git_user_email
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
K8S_NAMESPACE = var.namespace
K8S_SERVICE_ACCOUNT_HEAD = local.ray_head_kubernetes_service_account
K8S_SERVICE_ACCOUNT_WORKER = local.ray_worker_kubernetes_service_account
+ K8S_SERVICE_ACCOUNT_MLFLOW = local.mlflow_kubernetes_service_account
+ DATA_BUCKET = local.bucket_data_name
+ PROJECT_ID = data.google_project.environment.project_id
}
}
triggers = {
md5_files = md5(join("", [for f in fileset("${path.module}/templates/configsync/templates/_namespace_template/app", "**") : md5("${path.module}/templates/configsync/templates/_namespace_template/app/${f}")]))
- md5_script = filemd5("${path.module}/scripts/ray_cluster_namespace_manifests.sh")
+ md5_script = filemd5("${path.module}/scripts/cluster_namespace_manifests.sh")
}
}
@@ -347,6 +346,19 @@ resource "local_file" "policy_iap_ray_head_yaml" {
filename = "${local.gateway_manifests_directory}/policy-iap-ray-head.yaml"
}
+resource "local_file" "policy_iap_mlflow_tracking_yaml" {
+ content = templatefile(
+ "${path.module}/templates/gateway/gcp-backend-policy-iap-service.tftpl.yaml",
+ {
+ oauth_client_id = google_iap_client.ray_head_client.client_id
+ oauth_client_secret_name = kubernetes_secret_v1.ray_head_client.metadata[0].name
+ policy_name = "mlflow"
+ service_name = local.mlflow_tracking_service_name
+ }
+ )
+ filename = "${local.gateway_manifests_directory}/policy-iap-mlflow.yaml"
+}
+
resource "local_file" "gateway_kustomization_yaml" {
content = templatefile(
"${path.module}/templates/kustomize/kustomization.tftpl.yaml",
@@ -356,6 +368,8 @@ resource "local_file" "gateway_kustomization_yaml" {
basename(local_file.gateway_external_https_yaml.filename),
basename(local_file.policy_iap_ray_head_yaml.filename),
basename(local_file.route_ray_dashboard_https_yaml.filename),
+ basename(local_file.policy_iap_mlflow_tracking_yaml.filename),
+ basename(local_file.route_mlflow_tracking_https_yaml.filename),
]
}
)
@@ -367,22 +381,22 @@ resource "null_resource" "gateway_manifests" {
google_compute_managed_ssl_certificate.external_gateway,
google_endpoints_service.ray_dashboard_https,
google_gke_hub_feature_membership.cluster_configmanagement,
+ google_secret_manager_secret_version.git_config,
kubernetes_secret_v1.ray_head_client,
- null_resource.ray_cluster_namespace_manifests
+ null_resource.cluster_namespace_manifests,
]
provisioner "local-exec" {
command = "scripts/gateway_manifests.sh"
environment = {
- ENVIRONMENT_NAME = self.triggers.environment_name
- GIT_EMAIL = self.triggers.github_email
- GIT_REPOSITORY = self.triggers.git_repository
- GIT_TOKEN = self.triggers.github_token
- GIT_USERNAME = self.triggers.github_user
- KUBECONFIG = self.triggers.kubeconfig
- K8S_NAMESPACE = self.triggers.namespace
- REPO_SYNC_NAME = self.triggers.repo_sync_name
- REPO_SYNC_NAMESPACE = self.triggers.repo_sync_namespace
+ ENVIRONMENT_NAME = self.triggers.environment_name
+ GIT_CONFIG_SECRET_NAME = self.triggers.git_config_secret_name
+ GIT_REPOSITORY = self.triggers.git_repository
+ KUBECONFIG = self.triggers.kubeconfig
+ K8S_NAMESPACE = self.triggers.namespace
+ PROJECT_ID = self.triggers.project_id
+ REPO_SYNC_NAME = self.triggers.repo_sync_name
+ REPO_SYNC_NAMESPACE = self.triggers.repo_sync_namespace
}
interpreter = ["bash", "-c"]
working_dir = path.module
@@ -391,14 +405,13 @@ resource "null_resource" "gateway_manifests" {
provisioner "local-exec" {
command = "scripts/gateway_cleanup.sh"
environment = {
- GIT_EMAIL = self.triggers.github_email
- GIT_REPOSITORY = self.triggers.git_repository
- GIT_TOKEN = self.triggers.github_token
- GIT_USERNAME = self.triggers.github_user
- K8S_NAMESPACE = self.triggers.namespace
- KUBECONFIG = self.triggers.kubeconfig
- REPO_SYNC_NAME = self.triggers.repo_sync_name
- REPO_SYNC_NAMESPACE = self.triggers.repo_sync_namespace
+ GIT_CONFIG_SECRET_NAME = self.triggers.git_config_secret_name
+ GIT_REPOSITORY = self.triggers.git_repository
+ K8S_NAMESPACE = self.triggers.namespace
+ KUBECONFIG = self.triggers.kubeconfig
+ PROJECT_ID = self.triggers.project_id
+ REPO_SYNC_NAME = self.triggers.repo_sync_name
+ REPO_SYNC_NAMESPACE = self.triggers.repo_sync_namespace
}
interpreter = ["bash", "-c"]
when = destroy
@@ -406,21 +419,25 @@ resource "null_resource" "gateway_manifests" {
}
triggers = {
- environment_name = var.environment_name
- gateway_name = local.gateway_name
- git_repository = local.git_repository
- github_email = var.git_user_email
- github_token = var.git_token
- github_user = var.git_user_name
- kubeconfig = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
- md5_script = filemd5("${path.module}/scripts/gateway_manifests.sh")
+ environment_name = var.environment_name
+ gateway_name = local.gateway_name
+ git_config_secret_name = local.git_config_secret_name
+ git_repository = local.git_repository
+ github_email = var.git_user_email
+ github_token = var.git_token
+ github_user = var.git_user_name
+ kubeconfig = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
+ md5_script = filemd5("${path.module}/scripts/gateway_manifests.sh")
md5_files = md5(join("", [
local_file.gateway_external_https_yaml.content_md5,
local_file.policy_iap_ray_head_yaml.content_md5,
+ local_file.policy_iap_mlflow_tracking_yaml.content_md5,
local_file.route_ray_dashboard_https_yaml.content_md5,
+ local_file.route_mlflow_tracking_https_yaml.content_md5,
local_file.gateway_kustomization_yaml.content_md5
]))
namespace = data.kubernetes_namespace_v1.team.metadata[0].name
+ project_id = data.google_project.environment.project_id
repo_sync_name = "${var.environment_name}-${data.kubernetes_namespace_v1.team.metadata[0].name}"
repo_sync_namespace = data.kubernetes_namespace_v1.team.metadata[0].name
@@ -434,20 +451,20 @@ resource "null_resource" "gateway_manifests" {
resource "null_resource" "wait_for_configsync" {
depends_on = [
google_gke_hub_feature_membership.cluster_configmanagement,
- null_resource.gateway_manifests
+ google_secret_manager_secret_version.git_config,
+ null_resource.gateway_manifests,
]
provisioner "local-exec" {
command = "${path.module}/scripts/wait_for_configsync.sh"
environment = {
- GIT_EMAIL = var.git_user_email
- GIT_REPOSITORY = local.git_repository
- GIT_TOKEN = var.git_token
- GIT_USERNAME = var.git_user_name
- KUBECONFIG = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
- REPO_SYNC_NAME = "${var.environment_name}-${data.kubernetes_namespace_v1.team.metadata[0].name}"
- REPO_SYNC_NAMESPACE = data.kubernetes_namespace_v1.team.metadata[0].name
- ROOT_SYNC_NAME = "root-sync"
+ GIT_CONFIG_SECRET_NAME = local.git_config_secret_name
+ GIT_REPOSITORY = local.git_repository
+ KUBECONFIG = "${local.kubeconfig_dir}/${data.google_project.environment.project_id}_${google_gke_hub_membership.cluster.membership_id}"
+ PROJECT_ID = data.google_project.environment.project_id
+ REPO_SYNC_NAME = "${var.environment_name}-${data.kubernetes_namespace_v1.team.metadata[0].name}"
+ REPO_SYNC_NAMESPACE = data.kubernetes_namespace_v1.team.metadata[0].name
+ ROOT_SYNC_NAME = "root-sync"
}
}
diff --git a/best-practices/ml-platform/examples/platform/playground/mvp_resources.tf b/best-practices/ml-platform/examples/platform/playground/mvp_resources.tf
new file mode 100644
index 000000000..64d0b14f0
--- /dev/null
+++ b/best-practices/ml-platform/examples/platform/playground/mvp_resources.tf
@@ -0,0 +1,267 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+locals {
+ bucket_data_name = "${data.google_project.environment.project_id}-${var.environment_name}-data"
+ bucket_model_name = "${data.google_project.environment.project_id}-${var.environment_name}-model"
+ data_preparation_ksa = "${var.environment_name}-${var.namespace}-data-preparation"
+ data_processing_ksa = "${var.environment_name}-${var.namespace}-data-processing"
+ fine_tuning_ksa = "${var.environment_name}-${var.namespace}-fine-tuning"
+ gsa_build_account_id = "${var.environment_name}-${var.namespace}-build"
+ gsa_build_email = google_service_account.build.email
+ gsa_build_roles = [
+ "roles/logging.logWriter",
+ ]
+ model_evaluation_ksa = "${var.environment_name}-${var.namespace}-model-evaluation"
+ repo_container_images_id = var.environment_name
+ repo_container_images_url = "${google_artifact_registry_repository.container_images.location}-docker.pkg.dev/${google_artifact_registry_repository.container_images.project}/${local.repo_container_images_id}"
+ wi_member_principal_prefix = "principal://iam.googleapis.com/projects/${data.google_project.environment.number}/locations/global/workloadIdentityPools/${data.google_project.environment.project_id}.svc.id.goog/subject/ns/${var.namespace}/sa"
+}
+
+# SERVICES
+###############################################################################
+resource "google_project_service" "aiplatform_googleapis_com" {
+ disable_dependent_services = false
+ disable_on_destroy = false
+ project = data.google_project.environment.project_id
+ service = "aiplatform.googleapis.com"
+}
+
+resource "google_project_service" "artifactregistry_googleapis_com" {
+ disable_dependent_services = false
+ disable_on_destroy = false
+ project = data.google_project.environment.project_id
+ service = "artifactregistry.googleapis.com"
+}
+
+resource "google_project_service" "cloudbuild_googleapis_com" {
+ disable_dependent_services = false
+ disable_on_destroy = false
+ project = data.google_project.environment.project_id
+ service = "cloudbuild.googleapis.com"
+}
+
+# ARTIFACT REGISTRY
+###############################################################################
+resource "google_artifact_registry_repository" "container_images" {
+ format = "DOCKER"
+ location = var.subnet_01_region
+ project = google_project_service.artifactregistry_googleapis_com.project
+ repository_id = local.repo_container_images_id
+}
+
+# GCS
+###############################################################################
+resource "google_storage_bucket" "data" {
+ depends_on = [
+ google_container_cluster.mlp
+ ]
+
+ force_destroy = true
+ location = var.subnet_01_region
+ name = local.bucket_data_name
+ project = data.google_project.environment.project_id
+ uniform_bucket_level_access = true
+}
+
+resource "google_storage_bucket" "model" {
+ depends_on = [
+ google_container_cluster.mlp
+ ]
+
+ force_destroy = true
+ location = var.subnet_01_region
+ name = local.bucket_model_name
+ project = data.google_project.environment.project_id
+ uniform_bucket_level_access = true
+}
+
+# GSA
+###############################################################################
+resource "google_service_account" "build" {
+ project = data.google_project.environment.project_id
+ account_id = local.gsa_build_account_id
+ display_name = "${local.gsa_build_account_id} Service Account"
+ description = "Terraform-managed service account for ${local.gsa_build_account_id}"
+}
+
+resource "google_project_iam_member" "gsa_build" {
+ for_each = toset(local.gsa_build_roles)
+
+ project = data.google_project.environment.project_id
+ member = google_service_account.build.member
+ role = each.value
+}
+
+resource "google_artifact_registry_repository_iam_member" "container_images_gsa_build_artifactregistry_writer" {
+ location = google_artifact_registry_repository.container_images.location
+ member = google_service_account.build.member
+ project = google_artifact_registry_repository.container_images.project
+ repository = google_artifact_registry_repository.container_images.name
+ role = "roles/artifactregistry.writer"
+}
+
+resource "google_storage_bucket_iam_member" "cloudbuild_bucket_gsa_build_storage_object_viewer" {
+ bucket = "${data.google_project.environment.project_id}_cloudbuild"
+ member = google_service_account.build.member
+ role = "roles/storage.objectViewer"
+}
+
+# KSA
+###############################################################################
+resource "kubernetes_service_account_v1" "data_processing" {
+ depends_on = [
+ null_resource.git_cred_secret_ns,
+ null_resource.namespace_manifests
+ ]
+
+ metadata {
+ name = local.data_processing_ksa
+ namespace = var.namespace
+ }
+}
+
+resource "kubernetes_service_account_v1" "data_preparation" {
+ depends_on = [
+ null_resource.git_cred_secret_ns,
+ null_resource.namespace_manifests
+ ]
+
+ metadata {
+ name = local.data_preparation_ksa
+ namespace = var.namespace
+ }
+}
+
+resource "kubernetes_service_account_v1" "fine_tuning" {
+ depends_on = [
+ null_resource.git_cred_secret_ns,
+ null_resource.namespace_manifests
+ ]
+
+ metadata {
+ name = local.fine_tuning_ksa
+ namespace = var.namespace
+ }
+}
+
+resource "kubernetes_service_account_v1" "model_evaluation" {
+ depends_on = [
+ null_resource.git_cred_secret_ns,
+ null_resource.namespace_manifests
+ ]
+
+ metadata {
+ name = local.model_evaluation_ksa
+ namespace = var.namespace
+ }
+}
+
+# IAM
+###############################################################################
+resource "google_storage_bucket_iam_member" "data_bucket_ray_head_storage_object_viewer" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.ray_head_kubernetes_service_account}"
+ role = "roles/storage.objectViewer"
+}
+
+resource "google_storage_bucket_iam_member" "data_bucket_ray_worker_storage_object_admin" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.ray_worker_kubernetes_service_account}"
+ role = "roles/storage.objectAdmin"
+}
+
+resource "google_storage_bucket_iam_member" "data_bucket_mlflow_storage_object_admin" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.mlflow_kubernetes_service_account}"
+ role = "roles/storage.objectAdmin"
+}
+
+resource "google_storage_bucket_iam_member" "data_bucket_data_processing_ksa_storage_object_user" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.data_processing_ksa}"
+ role = "roles/storage.objectUser"
+}
+
+resource "google_storage_bucket_iam_member" "data_bucket_data_preparation_storage_object_user" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.data_preparation_ksa}"
+ role = "roles/storage.objectUser"
+}
+
+resource "google_project_iam_member" "data_preparation_aiplatform_user" {
+ depends_on = [
+ google_container_cluster.mlp
+ ]
+
+ project = data.google_project.environment.project_id
+ member = "${local.wi_member_principal_prefix}/${local.data_preparation_ksa}"
+ role = "roles/aiplatform.user"
+}
+
+resource "google_storage_bucket_iam_member" "data_bucket_fine_tuning_storage_object_user" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.fine_tuning_ksa}"
+ role = "roles/storage.objectUser"
+}
+
+resource "google_storage_bucket_iam_member" "model_bucket_fine_tuning_storage_object_user" {
+ bucket = google_storage_bucket.model.name
+ member = "${local.wi_member_principal_prefix}/${local.fine_tuning_ksa}"
+ role = "roles/storage.objectUser"
+}
+
+resource "google_storage_bucket_iam_member" "data_bucket_model_evaluation_storage_storage_insights_collector_service" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.model_evaluation_ksa}"
+ role = "roles/storage.insightsCollectorService"
+}
+
+resource "google_storage_bucket_iam_member" "data_bucket_model_evaluation_storage_object_user" {
+ bucket = google_storage_bucket.data.name
+ member = "${local.wi_member_principal_prefix}/${local.model_evaluation_ksa}"
+ role = "roles/storage.objectUser"
+}
+
+resource "google_storage_bucket_iam_member" "model_bucket_model_evaluation_storage_object_user" {
+ bucket = google_storage_bucket.model.name
+ member = "${local.wi_member_principal_prefix}/${local.model_evaluation_ksa}"
+ role = "roles/storage.objectUser"
+}
+
+output "environment_configuration" {
+ value = </dev/null 2>&1; do
sleep 2
diff --git a/best-practices/ml-platform/examples/platform/playground/scripts/helpers/clone_git_repo.sh b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/clone_git_repo.sh
index 3a8ea77c2..c2b7e36c3 100755
--- a/best-practices/ml-platform/examples/platform/playground/scripts/helpers/clone_git_repo.sh
+++ b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/clone_git_repo.sh
@@ -21,6 +21,8 @@ function cleanup() {
}
trap cleanup EXIT
+source ${SCRIPT_PATH}/helpers/git_config_env.sh
+
random_suffix=$(echo $RANDOM | md5sum | head -c 20)
export GIT_REPOSITORY_PATH="/tmp/$(basename ${GIT_REPOSITORY})-${random_suffix}"
diff --git a/best-practices/ml-platform/examples/platform/playground/scripts/helpers/git_config_env.sh b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/git_config_env.sh
new file mode 100755
index 000000000..8f0bc472b
--- /dev/null
+++ b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/git_config_env.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+#
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+set -u
+
+GIT_EMAIL=$(gcloud secrets versions access latest --project ${PROJECT_ID} --secret ${GIT_CONFIG_SECRET_NAME} | jq --raw-output .user_email)
+GIT_TOKEN=$(gcloud secrets versions access latest --project ${PROJECT_ID} --secret ${GIT_CONFIG_SECRET_NAME} | jq --raw-output .token)
+GIT_USERNAME=$(gcloud secrets versions access latest --project ${PROJECT_ID} --secret ${GIT_CONFIG_SECRET_NAME} | jq --raw-output .user_name)
diff --git a/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_repo_sync.sh b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_repo_sync.sh
index 472202ce7..5fd14e486 100755
--- a/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_repo_sync.sh
+++ b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_repo_sync.sh
@@ -24,4 +24,4 @@ while [[ ${last_synced_commit} != ${LAST_COMMIT} ]]; do
last_synced_commit=$(kubectl get reposync ${REPO_SYNC_NAME} -n ${REPO_SYNC_NAMESPACE} --output go-template='{{.status.lastSyncedCommit}}')
done
-echo "reposync '${REPO_SYNC_NAME}' has synchronize"
+echo "reposync '${REPO_SYNC_NAME}' has synchronized"
diff --git a/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_root_sync.sh b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_root_sync.sh
index 80f5ce4e9..5c1839a7c 100755
--- a/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_root_sync.sh
+++ b/best-practices/ml-platform/examples/platform/playground/scripts/helpers/wait_for_root_sync.sh
@@ -24,4 +24,4 @@ while [[ ${last_synced_commit} != ${LAST_COMMIT} ]]; do
last_synced_commit=$(kubectl get rootsync ${ROOT_SYNC_NAME} -n config-management-system --output go-template='{{.status.lastSyncedCommit}}')
done
-echo "rootsync '${ROOT_SYNC_NAME}' has synchronize"
+echo "rootsync '${ROOT_SYNC_NAME}' has synchronized"
diff --git a/best-practices/ml-platform/examples/platform/playground/secret_manager.tf b/best-practices/ml-platform/examples/platform/playground/secret_manager.tf
new file mode 100644
index 000000000..b1ee79830
--- /dev/null
+++ b/best-practices/ml-platform/examples/platform/playground/secret_manager.tf
@@ -0,0 +1,37 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+locals {
+ git_config_secret_name = "${var.environment_name}-git-config"
+}
+
+resource "google_secret_manager_secret" "git_config" {
+ project = google_project_service.secretmanager_googleapis_com.project
+ secret_id = local.git_config_secret_name
+
+ replication {
+ auto {}
+ }
+}
+
+resource "google_secret_manager_secret_version" "git_config" {
+ secret = google_secret_manager_secret.git_config.id
+
+ secret_data = jsonencode({
+ "namespace" = var.git_namespace,
+ "token" = var.git_token,
+ "user_email" = var.git_user_email,
+ "user_name" = var.git_user_name,
+ })
+}
diff --git a/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_cluster_template/team/serviceaccount.yaml b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_cluster_template/team/serviceaccount.yaml
new file mode 100644
index 000000000..cf140c555
--- /dev/null
+++ b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_cluster_template/team/serviceaccount.yaml
@@ -0,0 +1,19 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: app-sa
+ namespace: NAMESPACE
diff --git a/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/fluentd_config.yaml b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/fluentbit_config.yaml
similarity index 78%
rename from best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/fluentd_config.yaml
rename to best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/fluentbit_config.yaml
index 23b3e458f..0ec2fc19f 100644
--- a/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/fluentd_config.yaml
+++ b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/fluentbit_config.yaml
@@ -20,29 +20,32 @@ kind: ConfigMap
metadata:
name: fluentbit-config
data:
- # Based on https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/modules/kuberay-logging/config/fluent-bit.conf
- fluent-bit.conf: |
+ fluent-bit.conf: |-
[SERVICE]
+ Flush 1 // Flush logs every 1 second, even if the buffer is not full to minimize log entry arrival delay
+ Grace 5
Parsers_File parsers.conf
[INPUT]
Name tail
Path /tmp/ray/session_latest/logs/*
Tag ray
Path_Key filename
+ Buffer_Max_Size 1MB
+ Mem_Buf_Limit 5MB
Refresh_Interval 5
- [FILTERS]
+ [FILTER]
Name parser
Match ray
Key_Name filename
Parser rayjob
Reserve_Data On
+ Preserve_Key True
[OUTPUT]
Name stdout
Format json_lines
Match *
- # Based on https://github.com/GoogleCloudPlatform/ai-on-gke/blob/main/modules/kuberay-logging/config/parsers.conf
- parsers.conf: |
+ parsers.conf: |-
[PARSER]
Name rayjob
Format regex
- Regex (?raysubmit_[^.]*)
\ No newline at end of file
+ Regex job-driver-(?[^.]*)\.log
diff --git a/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/kustomization.yaml b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/kustomization.yaml
index 533ca8fdb..92d83d6bb 100644
--- a/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/kustomization.yaml
+++ b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/kustomization.yaml
@@ -24,7 +24,7 @@ helmCharts:
valuesFile: values.yaml
resources:
-- fluentd_config.yaml
+- fluentbit_config.yaml
- podmonitoring_ray.yaml
- serviceaccount_ray_head.yaml
- serviceaccount_ray_worker.yaml
diff --git a/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/mlflow_tracking.yaml b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/mlflow_tracking.yaml
new file mode 100644
index 000000000..14057db48
--- /dev/null
+++ b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/mlflow_tracking.yaml
@@ -0,0 +1,73 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: mlflow-tracking
+spec:
+ replicas: 1
+ strategy:
+ type: RollingUpdate
+ selector:
+ matchLabels:
+ app: mlflow-tracking
+ template:
+ metadata:
+ labels:
+ app: mlflow-tracking
+ annotations:
+ gke-gcsfuse/volumes: "true"
+ spec:
+ serviceAccountName: KUBERNETES_SERVICE_ACCOUNT_MLFLOW
+ containers:
+ - name: mlflow
+ image: ghcr.io/mlflow/mlflow:v2.15.1
+ resources:
+ requests:
+ cpu: "2"
+ memory: "10Gi"
+ limits:
+ cpu: "2"
+ memory: "10Gi"
+ command: ["sh", "-c"]
+ # mlflow prometheus metrics
+ # pip install --no-cache-dir prometheus_flask_exporter
+ # mlflow server --expose-prometheus /metrics
+ args:
+ - |
+ mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:///mlruns/mlflow.db
+ volumeMounts:
+ - name: gcs-fuse-csi-ephemeral
+ mountPath: /mlruns
+ volumes:
+ - name: gcs-fuse-csi-ephemeral
+ csi:
+ driver: gcsfuse.csi.storage.gke.io
+ volumeAttributes:
+ bucketName: DB_BUCKET
+ mountOptions: "implicit-dirs"
+ gcsfuseLoggingSeverity: warning
+---
+apiVersion: v1
+kind: Service
+metadata:
+ name: mlflow-tracking-svc
+spec:
+ selector:
+ app: mlflow-tracking
+ ports:
+ - protocol: TCP
+ port: 5000
+ targetPort: 5000
diff --git a/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/serviceaccount_mlflow.yaml b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/serviceaccount_mlflow.yaml
new file mode 100644
index 000000000..e0c2a389c
--- /dev/null
+++ b/best-practices/ml-platform/examples/platform/playground/templates/configsync/templates/_namespace_template/app/serviceaccount_mlflow.yaml
@@ -0,0 +1,19 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+ name: KUBERNETES_SERVICE_ACCOUNT_MLFLOW
+ namespace: NAMESPACE
diff --git a/best-practices/ml-platform/examples/platform/playground/variables.tf b/best-practices/ml-platform/examples/platform/playground/variables.tf
index c206a7f4e..7c6fd0472 100644
--- a/best-practices/ml-platform/examples/platform/playground/variables.tf
+++ b/best-practices/ml-platform/examples/platform/playground/variables.tf
@@ -12,9 +12,9 @@
# See the License for the specific language governing permissions and
# limitations under the License.
-variable "cluster_name" {
+variable "cluster_name_prefix" {
default = "mlp"
- description = "Name of the GKE cluster"
+ description = "GKE cluster name prefix"
type = string
}
diff --git a/best-practices/ml-platform/examples/platform/playground/versions.tf b/best-practices/ml-platform/examples/platform/playground/versions.tf
index 350587648..f539bde14 100644
--- a/best-practices/ml-platform/examples/platform/playground/versions.tf
+++ b/best-practices/ml-platform/examples/platform/playground/versions.tf
@@ -18,11 +18,11 @@ terraform {
required_providers {
google = {
source = "hashicorp/google"
- version = "5.35.0"
+ version = "5.38.0"
}
google-beta = {
source = "hashicorp/google-beta"
- version = "5.35.0"
+ version = "5.38.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/DEVELOPER.md b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/DEVELOPER.md
new file mode 100644
index 000000000..81abbdfac
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/DEVELOPER.md
@@ -0,0 +1,75 @@
+# Fine Tuning PyTorch Developer Guide
+
+- Install [`pyenv`](https://github.com/pyenv/pyenv?tab=readme-ov-file#installation)
+
+- Install the `python` version
+
+ ```
+ pyenv install 3.12.4
+ ```
+
+- Clone the repository
+
+ ```
+ git clone https://github.com/GoogleCloudPlatform/ai-on-gke.git && \
+ cd ai-on-gke
+ ```
+
+- Change directory to the `src` directory
+
+ ```
+ cd best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src
+ ```
+
+- Set the local `python` version
+
+ ```
+ pyenv local 3.12.4
+ ```
+
+- Create a virtual environment
+
+ ```
+ python -m venv venv
+ ```
+
+- Activate the virtual environment
+
+ ```
+ source venv/bin/activate
+ ```
+
+- Install the requirements
+
+ ```
+ pip install --no-cache-dir -r requirements.txt
+ ```
+
+- Set environment variables
+
+ ```
+ export PROJECT_ID=
+ export MLP_ENVIRONMENT_NAME=
+
+ export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
+
+ export BUCKET="${PROJECT_ID}-${MLP_ENVIRONMENT_NAME}-flipkart-data"
+ export NAMESPACE=ml-team
+ export KSA=default
+ export CLUSTER_NAME="mlp-${MLP_ENVIRONMENT_NAME}"
+ export CLUSTER_REGION=us-central1
+ export DOCKER_IMAGE_URL="us-docker.pkg.dev/${PROJECT_ID}/${MLP_ENVIRONMENT_NAME}-llm-finetuning/dataprep:v1.0.0"
+ export REGION=us-central1
+
+ export DATASET_INPUT_PATH="flipkart_preprocessed_dataset"
+ export DATASET_INPUT_FILE="flipkart.csv"
+ export DATASET_OUTPUT_PATH="dataset/output"
+ export PROMPT_MODEL_ID="gemini-1.5-flash-001"
+
+ ```
+
+- Run the `dataprep.py` script
+
+ ```
+ python dataprep.py
+ ```
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/README.md b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/README.md
new file mode 100644
index 000000000..82508046b
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/README.md
@@ -0,0 +1,151 @@
+# Data Preparation
+
+A processed Flipkart product catalog is used as input data to generate prompts in preparation for fine-tuning. The prompts are generated using Vertex AI's Gemini Flash model. The output is a data set that can be used to fine-tune the base model.
+
+Depending on the infrastructure you provisioned, the data preparation step takes approximately 1 hour and 40 minutes.
+
+## Prerequisites
+
+- This guide was developed to be run on the [playground machine learning platform](/best-practices/ml-platform/examples/platform/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
+- A bucket containing the processed data from the [Data Processing example](../../data-processing/ray)
+
+## Preparation
+
+- Clone the repository and change directory to the guide directory
+
+ ```sh
+ git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
+ cd ai-on-gke/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it
+ ```
+
+- Ensure that your `MLP_ENVIRONMENT_FILE` is configured
+
+ ```
+ cat ${MLP_ENVIRONMENT_FILE} && \
+ source ${MLP_ENVIRONMENT_FILE}
+ ```
+
+ > You should see the various variables populated with the information specific to your environment.
+
+### Vertex AI variables
+
+- Set `REGION` to Google Cloud region to use for the Vertex AI API calls
+
+ ```
+ REGION=us-central1
+ ```
+
+## Build the container image
+
+- Build the container image using Cloud Build and push the image to Artifact Registry
+
+ ```
+ cd src
+ sed -i -e "s|^serviceAccount:.*|serviceAccount: projects/${MLP_PROJECT_ID}/serviceAccounts/${MLP_BUILD_GSA}|" cloudbuild.yaml
+ gcloud beta builds submit --config cloudbuild.yaml \
+ --project ${MLP_PROJECT_ID} \
+ --substitutions _DESTINATION=${MLP_DATA_PREPARATION_IMAGE}
+ cd ..
+ ```
+
+## Run the job
+
+- Get credentials for the GKE cluster
+
+ ```sh
+ gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID}
+ ```
+
+- Configure the job
+
+ | Variable | Description | Example |
+ | ------------------- | ------------------------------------------------------------------------------------------------------------- | ----------------------------- |
+ | DATASET_INPUT_PATH | The folder path of where the preprocessed flipkart data resides | flipkart_preprocessed_dataset |
+ | DATASET_INPUT_FILE | The filename of the preprocessed flipkart data | flipkart.csv |
+ | DATASET_OUTPUT_PATH | The folder path of where the generated output data set will reside. This path will be needed for fine-tuning. | dataset/output |
+ | PROMPT_MODEL_ID | The Vertex AI model for prompt generation | gemini-1.5-flash-001 |
+
+ ```sh
+ DATASET_INPUT_PATH="flipkart_preprocessed_dataset"
+ DATASET_INPUT_FILE="flipkart.csv"
+ DATASET_OUTPUT_PATH="dataset/output"
+ PROMPT_MODEL_ID="gemini-1.5-flash-001"
+ ```
+
+ ```sh
+ sed \
+ -i -e "s|V_IMAGE_URL|${MLP_DATA_PREPARATION_IMAGE}|" \
+ -i -e "s|V_KSA|${MLP_DATA_PREPARATION_KSA}|" \
+ -i -e "s|V_PROJECT_ID|${MLP_PROJECT_ID}|" \
+ -i -e "s|V_DATA_BUCKET|${MLP_DATA_BUCKET}|" \
+ -i -e "s|V_DATASET_INPUT_PATH|${DATASET_INPUT_PATH}|" \
+ -i -e "s|V_DATASET_INPUT_FILE|${DATASET_INPUT_FILE}|" \
+ -i -e "s|V_DATASET_OUTPUT_PATH|${DATASET_OUTPUT_PATH}|" \
+ -i -e "s|V_PROMPT_MODEL_ID|${PROMPT_MODEL_ID}|" \
+ -i -e "s|V_REGION|${REGION}|" \
+ manifests/job.yaml
+ ```
+
+- Create the job
+
+ ```sh
+ kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/job.yaml
+ ```
+
+- Once the Job is completed, the prepared datasets are stored in Google Cloud Storage.
+
+ ```sh
+ gcloud storage ls gs://${MLP_DATA_BUCKET}/${DATASET_OUTPUT_PATH}
+ ```
+
+## Observability
+
+By default, both GKE and the workloads you run expose metrics and logs in Google Cloud's Observability suite. You can view this information from the Cloud Observability console or the GKE Observability page.
+
+For more information about infrastructure and application metrics, see [View observability metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/view-observability-metrics).
+
+You may want to perform the following tasks specifically for the data preparation use case described in this example.
+
+### Monitor the job
+
+In the Google Cloud console, go to the [Kubernetes Engine](https://console.cloud.google.com/kubernetes) page. Under the `Resource Management` menu on the left side, click `Workloads`. From there, you can filter the workloads by cluster name and namespaces. The `Observability` tab provides system level metric views such as `Overview`, `CPU`, and `Memory`. If you click the job name like `data-prep`, you can see the job details like the following page:
+
+![monitor-job](../../../../docs/images/monitor-job.png)
+
+At the bottom of the page, you can see the status of the managed pods by the job. If your job is having trouble running, the `EVENTS` and `LOGS` tabs will provide more insight. You can also adjust the time windows or open the `Container logs` and `Audit logs` for additional information.
+
+### View the logs
+
+To gain insight into your workload quickly, you can filter and tweak the log queries to view only the relevant logs. You can do so in the `Logs Explorer`. One fast way to open the Logs Explorer and have the query pre-populated is to click the `View in Logs Explorer` button on the right side of the `LOGS` tab once you are on the `Job details` page.
+
+When the link is opened, you should see something like the following:
+
+![log-explorer-query](../../../../docs/images/log-explorer-query.png)
+
+The Logs Explorer provides many nice features besides tweaking your log query in the `Query` field. For example, if you want to know which steps the job has completed, you can run the following query based on [the source code](src/dataprep.py#L318):
+
+```shell
+resource.type="k8s_container"
+resource.labels.location="us-central1"
+resource.labels.namespace_name="ml-team"
+jsonPayload.message = (
+"***Job Start***" OR
+"Configure signal handlers" OR
+"Prepare context for Gemini Flash's prompt" OR
+"Generate Q & A according" OR
+"Generate Prompts for Gemma IT model" OR
+"Upload prepared dataset into GCS" OR
+"***Job End***")
+```
+
+As another example, if you want to know how many prompts are generated in a specific time window, you can do something like the following:
+
+- From the code, look for the log entries that associate with the prompt generation. In this example, the `Content generated` log entry is produced each time a prompt is generated.
+- You can click the `Similar entries`, which automatically updates the log query for you and list all `Content generated` entries.
+- Adjust the timeline in the middle of the page and zoom in/out. You will see during a specific time window, such as 30 seconds, how many log entries are ingested. That number should be the same as the number of prompts generated by the code.
+
+### Log Analytics
+
+You can also use [Log Analytics](https://cloud.google.com/logging/docs/analyze/query-and-view) to analyze your logs. After it is enabled, you can run SQL queries to gain insight from the logs. The result can also be charted. For example, you can click the `Analyze results` link on the Logs Explorer page and open the Log Analytics page with a converted SQL query. The chart and table you view can also be added to a dashboard.
+
+![log-analytics](../../../../docs/images/log-analytics.png)
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/manifests/job.yaml b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/manifests/job.yaml
new file mode 100644
index 000000000..edb5baece
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/manifests/job.yaml
@@ -0,0 +1,45 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: data-prep
+spec:
+ backoffLimit: 0
+ template:
+ metadata:
+ labels:
+ app: data-prep
+ spec:
+ containers:
+ - name: job
+ image: V_IMAGE_URL
+ imagePullPolicy: Always
+ env:
+ - name: "BUCKET"
+ value: "V_DATA_BUCKET"
+ - name: "DATASET_INPUT_PATH"
+ value: "V_DATASET_INPUT_PATH"
+ - name: "DATASET_INPUT_FILE"
+ value: "V_DATASET_INPUT_FILE"
+ - name: "DATASET_OUTPUT_PATH"
+ value: "V_DATASET_OUTPUT_PATH"
+ - name: "PROJECT_ID"
+ value: "V_PROJECT_ID"
+ - name: "PROMPT_MODEL_ID"
+ value: "V_PROMPT_MODEL_ID"
+ - name: "REGION"
+ value: "V_REGION"
+ resources:
+ requests:
+ cpu: 100m
+ memory: 512Mi
+ limits:
+ cpu: 250m
+ memory: 1Gi
+ nodeSelector:
+ resource-type: "cpu"
+ restartPolicy: Never
+ serviceAccountName: V_KSA
+ tolerations:
+ - key: "on-demand"
+ operator: "Exists"
+ effect: "NoSchedule"
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/.gcloudignore b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/.gcloudignore
new file mode 100644
index 000000000..d66964de3
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/.gcloudignore
@@ -0,0 +1,7 @@
+__pycache__/
+.venv/
+venv/
+
+.gcloudignore
+.python-version
+cloudbuild.yaml
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/Dockerfile b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/Dockerfile
new file mode 100644
index 000000000..c71bffbc7
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/Dockerfile
@@ -0,0 +1,15 @@
+FROM python:3.12.4-slim
+
+RUN apt-get update && \
+ apt-get -y --no-install-recommends install python3-dev gcc python3-pip git && \
+ rm -rf /var/lib/apt/lists/*
+
+COPY requirements.txt /
+
+RUN pip3 install --no-cache-dir -r /requirements.txt
+
+COPY custom_json_formatter.py dataprep.py logging.conf /
+
+ENV PYTHONUNBUFFERED 1
+
+CMD python3 /dataprep.py
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/cloudbuild.yaml b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/cloudbuild.yaml
new file mode 100644
index 000000000..f5a8a5e4a
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/cloudbuild.yaml
@@ -0,0 +1,12 @@
+images:
+ - ${_DESTINATION}
+options:
+ logging: CLOUD_LOGGING_ONLY
+serviceAccount:
+steps:
+- name: 'gcr.io/cloud-builders/docker'
+ args:
+ - build
+ - -t
+ - ${_DESTINATION}
+ - .
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/custom_json_formatter.py b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/custom_json_formatter.py
new file mode 100644
index 000000000..0160054ec
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/custom_json_formatter.py
@@ -0,0 +1,91 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+
+
+class CustomJSONFormatter(logging.Formatter):
+ def __init__(
+ self,
+ fmt=None,
+ datefmt=None,
+ style="%",
+ validate=True,
+ defaults=None,
+ ):
+ super().__init__(
+ fmt=fmt,
+ datefmt=datefmt,
+ style=style,
+ validate=validate,
+ defaults=defaults,
+ )
+
+ self._ignore_keys = {
+ "args",
+ "levelno",
+ "msecs",
+ "msg",
+ "pathname",
+ }
+
+ self._rename_keys = {
+ "created": "timestamp",
+ "levelname": "level",
+ "relativeCreated": "runtime",
+ }
+
+ self._debug_keys = {
+ "filename",
+ "funcName",
+ "lineno",
+ "module",
+ }
+
+ # LogRecord object: https://docs.python.org/3/library/logging.html#logrecord-attributes
+ def format(self, record: logging.LogRecord) -> str:
+ entry = record.__dict__.copy()
+ entry["message"] = record.getMessage()
+
+ # Removed ignored keys
+ for key in self._ignore_keys:
+ entry.pop(key, None)
+
+ # Remove keys based on log level
+ if record.levelno > logging.DEBUG:
+ for key in self._debug_keys:
+ entry.pop(key, None)
+
+ # Rename keys
+ for key, value in list(self._rename_keys.items()):
+ if key in entry:
+ entry[value] = entry.pop(key)
+
+ if record.exc_info:
+ if not record.exc_text:
+ record.exc_text = self.formatException(record.exc_info)
+
+ if record.exc_text:
+ entry["exc_info"] = record.exc_text
+
+ if record.stack_info:
+ entry["stack_info"] = self.formatStack(record.stack_info)
+
+ # Delete keys with None value
+ for key, value in list(entry.items()):
+ if value is None:
+ del entry[key]
+
+ return json.dumps(entry)
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/dataprep.py b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/dataprep.py
new file mode 100644
index 000000000..50b8c1913
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/dataprep.py
@@ -0,0 +1,336 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import datasets
+import json
+import logging
+import logging.config
+import numpy as np
+import os
+import pandas as pd
+import re
+import signal
+import sys
+import tenacity
+import time
+import vertexai
+import vertexai.preview.generative_models as generative_models
+
+from datasets import Dataset, DatasetDict
+from google.api_core.exceptions import InternalServerError, ResourceExhausted
+from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_random_exponential
+from vertexai.preview.generative_models import GenerativeModel
+
+from custom_json_formatter import CustomJSONFormatter
+
+
+PROJECT_ID = os.environ.get("PROJECT_ID")
+# The bucket which contains the preprocessed data
+BUCKET = os.environ.get("BUCKET")
+REGION = os.environ.get("REGION")
+DATASET_INPUT = os.environ.get("DATASET_INPUT_PATH")
+DATASET_INPUT_FILE = os.environ.get("DATASET_INPUT_FILE")
+DATASET_OUTPUT = os.environ.get("DATASET_OUTPUT_PATH")
+MODEL_ID = os.environ.get("PROMPT_MODEL_ID")
+
+generation_config = {"max_output_tokens": 200, "temperature": 0.7}
+
+safety_settings = {
+ generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
+ generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
+ generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
+ generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_ONLY_HIGH,
+}
+
+num_questions = 3
+
+vertexai.init(project=PROJECT_ID, location=REGION)
+model = GenerativeModel(MODEL_ID)
+
+def filter_low_value_count_rows(df, column_name, min_count=10):
+ """
+ Removes rows from a DataFrame where the value count in the specified column is less than the given minimum count.
+
+ Args:
+ df: The Pandas DataFrame to filter.
+ column_name: The name of the column to check value counts for.
+ min_count: The minimum value count required for a row to be kept (default: 10).
+
+ Returns:
+ A new DataFrame with rows removed where value counts are below the threshold.
+ """
+
+ # Calculate value counts for the specified column
+ value_counts = df[column_name].value_counts()
+
+ # Filter values that meet the minimum count criteria
+ filtered_values = value_counts[value_counts >= min_count].index
+
+ # Create a new DataFrame keeping only rows with those values
+ filtered_df = df[df[column_name].isin(filtered_values)]
+
+ return filtered_df
+
+
+def prep_context():
+ preprocessed = pd.read_csv(f"gs://{BUCKET}/{DATASET_INPUT}/{DATASET_INPUT_FILE}")
+ # renaming column name
+ preprocessed.rename(
+ columns={
+ "uniq_id": "Id",
+ "product_name": "Name",
+ "description": "Description",
+ "brand": "Brand",
+ "attributes": "Specifications",
+ },
+ inplace=True,
+ )
+ df = preprocessed[
+ [
+ "Name",
+ "Description",
+ "Specifications",
+ "Brand",
+ "c0_name",
+ "c1_name",
+ "c2_name",
+ "c3_name",
+ ]
+ ]
+
+ # Filter only clothing products
+ filtered_df = df[df["c0_name"] == "Clothing"]
+
+ # Filter only Women, Men & Kids clothing products
+ values_to_filter = ["Women's Clothing", "Men's Clothing", "Kids' Clothing"]
+ clothing_filtered_df = filtered_df[filtered_df["c1_name"].isin(values_to_filter)]
+
+ # Filter to keep rows where 'c2_name' has count >=10
+ c2_filtered_df = filter_low_value_count_rows(
+ clothing_filtered_df, "c2_name", min_count=10
+ )
+
+ # Filter to keep rows where 'c3_name' has count >=10
+ c3_filtered_df = filter_low_value_count_rows(
+ c2_filtered_df, "c3_name", min_count=10
+ )
+
+ # Data Format expected for finetuning: {"context": " ", "question": " ", "answer": " "}
+ context_df = c3_filtered_df[["Name", "Description", "c1_name", "Specifications"]]
+ finetune_ds = pd.DataFrame(columns=["context", "question", "answer"])
+ finetune_ds["context"] = (
+ "Product Name: "
+ + context_df["Name"]
+ + " Product Category: "
+ + context_df["c1_name"]
+ + " Attributes: "
+ + context_df["Specifications"]
+ + " Description: "
+ + context_df["Description"]
+ )
+
+ finetune_ds["c1_name"] = context_df["c1_name"]
+ return finetune_ds
+
+
+def extract_product_details(text):
+ # Initialize empty string
+ output_string = ""
+
+ # Extract content before "Description:"
+ match = re.search(r"(.*?)Description:", text, re.DOTALL)
+ if match:
+ content_before_description = match.group(1)
+
+ # Remove tags and "Product Category:" line
+ cleaned_content = content_before_description.replace(" ", "\n")
+ lines = [
+ line.strip()
+ for line in cleaned_content.splitlines()
+ if line.strip() and not line.startswith("Product Category:")
+ ]
+
+ # Extract and parse attributes
+ match_attributes = re.search(
+ r"Attributes:\s*(\{.*?\})", cleaned_content, re.DOTALL
+ )
+ if match_attributes:
+ attributes_str = match_attributes.group(1)
+ attributes = json.loads(attributes_str)
+
+ # Append formatted output to output_string
+ for line in lines:
+ if not line.startswith("Attributes:"):
+ output_string += line + "\n"
+ output_string += "Product Details:\n"
+ for key, value in attributes.items():
+ output_string += f"- {key}: {value}\n"
+
+ # Return the final string
+ return output_string
+
+@retry(
+ retry=(
+ retry_if_exception_type(InternalServerError) |
+ retry_if_exception_type(ResourceExhausted)
+ ),
+ stop=stop_after_attempt(10),
+ wait=wait_random_exponential(exp_base=3, max=60, multiplier=1)
+)
+def generate_content(context):
+ try:
+ response = model.generate_content(
+ [f"Generate {num_questions} Search Queries in conversational tone and Answers for this product:\n{context}. Return the result without any formatting in a single line as Question : Answer ;"],
+ generation_config=generation_config,
+ safety_settings=safety_settings,
+ )
+ response.text
+ except InternalServerError as e:
+ logger.warning(f"InternalServerError exception caught: {e}")
+ raise
+ except ResourceExhausted as e:
+ logger.warning(e)
+ raise
+ except ValueError as e:
+ logger.debug("ValueError exception caught")
+ if response.candidates[0].finish_reason == generative_models.FinishReason.RECITATION:
+ logger.warning(f"Recitation returned", extra={"context": context, "response": str(response)})
+ return None
+ elif response.candidates[0].finish_reason == generative_models.FinishReason.SAFETY:
+ logger.warning(f"Blocked by safety settings", extra={"context": context, "response": str(response)})
+ return None
+ else:
+ logger.error(f"Unhandled ValueError", extra={"context": context}, exc_info=True)
+ raise
+ except Exception as e:
+ logger.error(f"Unhandled exception in generate_content: {type(e).__name__}", exc_info=True)
+ raise
+
+ return response.text
+
+
+def generate_qa(context, category):
+ try:
+ qa = generate_content(context)
+ except tenacity.RetryError as e:
+ logger.error(f"Exception: {e}, failed to generate content for context: {context}")
+ return None
+ except Exception as e:
+ logger.error(f"Unhandled exception from generate_content: {type(e).__name__}", exc_info=True)
+ raise
+
+ if qa == None:
+ return None
+
+ # Create a DataFrame
+ temp_df = pd.DataFrame(columns=["Question", "Answer", "Context"])
+ qa_list = qa.split(";")
+
+ # Create a list to hold the data
+ new_data = []
+
+ # Iterate over the QA items
+ for qa_item in qa_list:
+ q_a = qa_item.split(":")
+ if len(q_a) == 2:
+ ans = q_a[1].strip() + " \n " + extract_product_details(context)
+ # Append as the list
+ new_data.append([q_a[0].strip(), ans, f"Online shopping for {category}"])
+
+ # Create the DataFrame after collecting all data
+ temp_df = pd.DataFrame(new_data, columns=temp_df.columns)
+
+ return temp_df
+
+
+def generate_prompt(row):
+ return f"user\nContext:{row['Context']}\n{row['Question']}model\n{row['Answer']}"
+
+
+def data_prep(finetune_ds):
+ result = pd.DataFrame()
+ for context, category in zip(finetune_ds["context"], finetune_ds["c1_name"]):
+ if context != np.nan:
+ temp_df = generate_qa(context, category)
+ if temp_df is not None:
+ result = pd.concat([result, temp_df], ignore_index=True)
+ logger.info(f"Content generated", extra={"context": context})
+ # Now `result` contains all generated questions and answers
+ return result
+
+
+def train_validate_test_split(df):
+ logger.info(f"Total Data Size: {len(df)}")
+ train_size = int(0.8 * len(df))
+ val_size = int(0.1 * len(df))
+ train_df = df.sample(n=train_size, random_state=42)
+ remaining_df = df.drop(train_df.index)
+ val_df = remaining_df.sample(n=val_size, random_state=42)
+ test_df = remaining_df.drop(val_df.index)
+ logger.info(f"Training data size: {len(train_df)}, Validation data size: {len(val_df)}, Test data size: {len(test_df)}")
+ # Create DatasetDict with splits
+ dataset = DatasetDict(
+ {
+ "train": Dataset.from_pandas(train_df),
+ "validation": Dataset.from_pandas(val_df),
+ "test": Dataset.from_pandas(test_df),
+ }
+ )
+ dataset["train"].save_to_disk(f"gs://{BUCKET}/{DATASET_OUTPUT}/training/")
+ dataset["validation"].save_to_disk(f"gs://{BUCKET}/{DATASET_OUTPUT}/validation/")
+ dataset["test"].save_to_disk(f"gs://{BUCKET}/{DATASET_OUTPUT}/test/")
+
+
+def graceful_shutdown(signal_number, stack_frame):
+ signal_name = signal.Signals(signal_number).name
+
+ logger.info(f"Received {signal_name}({signal_number}), shutting down...")
+ #TODO: Add logic to handled checkpointing if required
+ sys.exit(0)
+
+if __name__ == "__main__":
+ # Configure logging
+ logging.config.fileConfig("logging.conf")
+
+ logger = logging.getLogger("processing")
+
+ if "LOG_LEVEL" in os.environ:
+ new_log_level = os.environ["LOG_LEVEL"].upper()
+ logger.info(
+ f"Log level set to '{new_log_level}' via LOG_LEVEL environment variable"
+ )
+ logging.getLogger().setLevel(new_log_level)
+ logger.setLevel(new_log_level)
+
+ datasets.disable_progress_bar()
+
+ logger.info("***Job Start***")
+
+ logger.info("Configure signal handlers")
+ signal.signal(signal.SIGINT, graceful_shutdown)
+ signal.signal(signal.SIGTERM, graceful_shutdown)
+
+ logger.info("Prepare context for Gemini Flash's prompt")
+ df = prep_context()
+
+ logger.info("Generate Q & A according")
+ res_df = data_prep(df)
+
+ logger.info("Generate Prompts for Gemma IT model")
+ res_df["prompt"] = res_df.apply(generate_prompt, axis=1)
+
+ logger.info("Upload prepared dataset into GCS")
+ train_validate_test_split(res_df)
+
+ logger.info("***Job End***")
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/logging.conf b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/logging.conf
new file mode 100644
index 000000000..7dee5bee5
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/logging.conf
@@ -0,0 +1,24 @@
+[formatters]
+keys=custom_json
+
+[formatter_custom_json]
+class=custom_json_formatter.CustomJSONFormatter
+
+[handlers]
+keys=console
+
+[handler_console]
+class=logging.StreamHandler
+args=(sys.stdout,)
+formatter=custom_json
+
+[loggers]
+keys=root
+
+[logger_root]
+level=INFO
+handlers=console
+
+[logger_processing]
+level=INFO
+handlers=console
diff --git a/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/requirements.txt b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/requirements.txt
new file mode 100644
index 000000000..1c04ac207
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-preparation/gemma-it/src/requirements.txt
@@ -0,0 +1,10 @@
+datasets==2.20.0
+gcsfs==2024.5.0
+google-cloud-aiplatform==1.49.0
+google-cloud-storage==2.17.0
+jsonpickle==3.2.2
+numpy==1.26.4
+pandas==2.2.2
+protobuf==4.24.4 ## lowered version
+tenacity==9.0.0
+vertexai==1.49.0
diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/CONVERSION.md b/best-practices/ml-platform/examples/use-case/data-processing/ray/CONVERSION.md
similarity index 100%
rename from best-practices/ml-platform/examples/use-case/ray/dataprocessing/CONVERSION.md
rename to best-practices/ml-platform/examples/use-case/data-processing/ray/CONVERSION.md
diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/DEVELOPER.md b/best-practices/ml-platform/examples/use-case/data-processing/ray/DEVELOPER.md
similarity index 93%
rename from best-practices/ml-platform/examples/use-case/ray/dataprocessing/DEVELOPER.md
rename to best-practices/ml-platform/examples/use-case/data-processing/ray/DEVELOPER.md
index 75f281546..74bbb0463 100644
--- a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/DEVELOPER.md
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/DEVELOPER.md
@@ -18,7 +18,7 @@
- Change directory to the `src` directory
```
- cd best-practices/ml-platform/examples/use-case/ray/dataprocessing/src
+ cd best-practices/ml-platform/examples/use-case/data-processing/ray/src
```
- Set the local `python` version
diff --git a/best-practices/ml-platform/examples/use-case/data-processing/ray/README.md b/best-practices/ml-platform/examples/use-case/data-processing/ray/README.md
new file mode 100644
index 000000000..11d65d5b2
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/README.md
@@ -0,0 +1,240 @@
+# Distributed Data Processing with Ray on GKE
+
+## Dataset
+
+[This](https://www.kaggle.com/datasets/PromptCloudHQ/flipkart-products) is a pre-crawled public dataset, taken as a subset of a bigger dataset (more than 5.8 million products) that was created by extracting data from [Flipkart](https://www.flipkart.com/), a leading Indian eCommerce store.
+
+## Architecture
+
+![data-processing](/best-practices/ml-platform/docs/images/use-case/data-processing-ray-workflow.png)
+
+## Data processing steps
+
+The dataset has product information such as id, name, brand, description, image urls, product specifications.
+
+The `preprocessing.py` file does the following:
+
+- Read the csv from Cloud Storage
+- Clean up the product description text
+- Extract image urls, validate and download the images into cloud storage
+- Cleanup & extract attributes as key-value pairs
+
+The data processing step takes approximately 18-20 minutes.
+
+## Prerequisites
+
+- This guide was developed to be run on the [playground machine learning platform](/best-practices/ml-platform/examples/platform/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
+
+## Preparation
+
+- Clone the repository and change directory to the guide directory
+
+ ```shell
+ git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
+ cd ai-on-gke/best-practices/ml-platform/examples/use-case/data-processing/ray
+ ```
+
+- Ensure that your `MLP_ENVIRONMENT_FILE` is configured
+
+ ```shell
+ cat ${MLP_ENVIRONMENT_FILE} && \
+ source ${MLP_ENVIRONMENT_FILE}
+ ```
+
+ > You should see the various variables populated with the information specific to your environment.
+
+## Configuration
+
+- Download the raw data csv file from [Kaggle](https://kaggle.com) and store it into the bucket created in the previous step.
+
+ - You will need kaggle cli to download the file. The kaggle cli can be installed using the following command in Cloud Shell:
+ ```shell
+ pip3 install --user kaggle
+ ```
+ For more details, you can read those [instructions](https://github.com/Kaggle/kaggle-api#installation).
+ - To use the cli you must create an API token. To create the token, register on [kaggle.com](https://kaggle.com) if you already don't have an account. Go to `kaggle.com/settings > API > Create New Token`, the downloaded file should be stored in `$HOME/.kaggle/kaggle.json`. Note, you will have to create the dir `$HOME/.kaggle`. After the configuration is done, you can run the following command to download the dataset and copy it to the GCS bucket:
+ ```shell
+ kaggle datasets download --unzip atharvjairath/flipkart-ecommerce-dataset && \
+ gcloud storage cp flipkart_com-ecommerce_sample.csv \
+ gs://${MLP_DATA_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv && \
+ rm flipkart_com-ecommerce_sample.csv
+ ```
+ - Alternatively, you can [downloaded the dataset](https://www.kaggle.com/datasets/atharvjairath/flipkart-ecommerce-dataset) directly from the kaggle website and copy it to the bucket.
+
+## Build the container image
+
+- Build container image using Cloud Build and push the image to Artifact Registry
+
+ ```shell
+ cd src
+ sed -i -e "s|^serviceAccount:.*|serviceAccount: projects/${MLP_PROJECT_ID}/serviceAccounts/${MLP_BUILD_GSA}|" cloudbuild.yaml
+ gcloud beta builds submit --config cloudbuild.yaml \
+ --project ${MLP_PROJECT_ID} \
+ --substitutions _DESTINATION=${MLP_DATA_PROCESSING_IMAGE}
+ cd ..
+ ```
+
+## Run the job
+
+- Get credentials for the GKE cluster
+
+ ```shell
+ gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID}
+ ```
+
+- Configure the job
+
+ ```shell
+ sed \
+ -i -e "s|V_DATA_BUCKET|${MLP_DATA_BUCKET}|" \
+ -i -e "s|V_IMAGE_URL|${MLP_DATA_PROCESSING_IMAGE}|" \
+ -i -e "s|V_KSA|${MLP_DATA_PROCESSING_KSA}|" \
+ manifests/job.yaml
+ ```
+
+- Create the job
+
+ ```shell
+ kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/job.yaml
+ ```
+
+- Monitor the execution in Ray Dashboard. You can run the following command to get the dashboard endpoint:
+ ```shell
+ echo -e "\n${MLP_KUBERNETES_NAMESPACE} Ray dashboard: ${MLP_RAY_DASHBOARD_NAMESPACE_ENDPOINT}\n"
+ ```
+ Read [the section about KubeRay](../../../platform/playground/README.md#software-installed-via-reposync-and-rootsync) for more info.
+- From the Ray Dashboard, view the following about the jobs:
+ - Jobs -> Running Job ID
+ - See the Tasks/actors overview for Running jobs
+ - See the Task Table for a detailed view of task and assigned node(s)
+ - Cluster -> Node List
+ - See the Ray actors running on the worker process
+
+- You can check the job status from the GKE console or [query the logs](#log-query-sample) in the [Logs Explorer](https://console.cloud.google.com/logs). Once the Job is completed, both the prepared dataset as a CSV and the images are stored in Google Cloud Storage.
+
+ ```shell
+ gcloud storage ls gs://${MLP_DATA_BUCKET}/flipkart_preprocessed_dataset/flipkart.csv
+ gcloud storage ls gs://${MLP_DATA_BUCKET}/flipkart_images
+ ```
+
+> For additional information about developing using this codebase see the [Developer Guide](DEVELOPER.md)
+
+> For additional information about converting you code from a notebook to run as a Job on GKE see the [Conversion Guide](CONVERSION.md)
+
+## Observability
+
+By default, both GKE and the workloads you run expose metrics and logs in Google Cloud's Observability suite. You can view that information either from the Cloud Observability console or the GKE Observability page.
+
+For more information about infrastructure and application metrics, see [View observability metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/view-observability-metrics).
+
+Specifically for the data processing use case described in this example, you can perform additional analysis based on the workload logs.
+
+### Log query sample
+
+In the Google Cloud console, go to the [Logs Explorer](https://console.cloud.google.com/logs) page to run your queries.
+
+- Find when the data processing job started and finished. You may need to adjust the time window in the UI or use [timestamp](https://cloud.google.com/logging/docs/view/logging-query-language) in the query:
+
+ ```
+ labels."k8s-pod/app"="data-processing"
+ resource.type="k8s_container"
+ jsonPayload.message: "Started" OR jsonPayload.message: "Finished"
+ severity=INFO
+ ```
+
+- Find all error logs for the job:
+
+ ```
+ labels."k8s-pod/app"="data-processing"
+ resource.type="k8s_container"
+ severity=ERROR
+ ```
+
+- Search for specific errors from the `textPayload` using a regex expression:
+
+ ```
+ labels."k8s-pod/app"="data-processing"
+ resource.type="k8s_container"
+ textPayload =~ "ray_worker_node_id.+Image.+not found$"
+ severity=ERROR
+ ```
+
+You can narrow down the results by adding extra filters, such as using additional labels. For more GKE query samples, you can read [Kubernetes-related queries](https://cloud.google.com/logging/docs/view/query-library#kubernetes-filters).
+
+### Log-based Metrics
+
+To gain insight into your workload status, you can also utilize [log-based metrics](https://cloud.google.com/logging/docs/logs-based-metrics). Several methods exist for their creation. The most straightforward approach involves modifying your log queries to locate the relevant logs. Subsequently, you can generate a custom metric by clicking the `Create metric` link and defining it as per your requirements. For example:
+
+![log-based-metrics](../../../../docs/images/create-log-based-metrics.png)
+
+For this example, the following query is used, utilizing a more specific regular expression to search the error logs. With the log entries found, you can create log-based metrics.
+
+```
+labels."k8s-pod/app"="data-processing"
+resource.type="k8s_container"
+textPayload =~ "ray_worker_node_id.+Image.+not found$"
+severity=ERROR
+```
+
+The following is a definition for a metric such as `No_Image_found_Product`. Notice both the GKE node and Ray worker node id are added as labels.
+
+```yaml
+filter: |-
+ labels."k8s-pod/app"="data-processing"
+ resource.type="k8s_container"
+ textPayload =~ "ray_worker_node_id.+Image.+not found$"
+ severity=ERROR
+labelExtractors:
+ gke_node: EXTRACT(labels."compute.googleapis.com/resource_name")
+ ray_woker_node_id: REGEXP_EXTRACT(textPayload, "ray_worker_node_id:(.+) Image")
+metricDescriptor:
+ labels:
+ - key: gke_node
+ - key: ray_woker_node_id
+ metricKind: DELTA
+ name: projects/xxxxx/metricDescriptors/logging.googleapis.com/user/No_Image_Found_Product
+ type: logging.googleapis.com/user/No_Image_Found_Product
+ unit: "1"
+ valueType: INT64
+name: No_Image_Found_Product
+resourceName: projects/xxxxx/metrics/No_Image_Found_Product
+```
+
+Once the metrics are defined, the next time you run your workloads, you will be able to use them. For example, the following chart visualizes the metric defined above:
+
+![use-log-based-metrics](../../../../docs/images/use-log-based-metrics.png)
+
+### Log Analytics
+
+You can also use [Log Analytics](https://cloud.google.com/logging/docs/log-analytics#analytics) to [analyze your logs]((https://cloud.google.com/logging/docs/analyze/query-and-view)). If your log buckets are not upgraded for Log Analytics, you need to upgrade them first. After the log buckets are upgraded, you can run SQL queries to gain insight from the newly ingested logs. The query results can also be charted. For example, the following query returns the `Image not found` error and chart the result:
+
+```sql
+WITH
+ logs AS (
+ SELECT
+ *
+ FROM
+ `[Your Project Id].global._Default._AllLogs` )
+SELECT
+ timestamp,
+ severity,
+ text_payload,
+ proto_payload,
+ json_payload
+FROM
+ logs
+WHERE
+ SAFE.STRING(logs.labels["k8s-pod/app"]) = "data-processing"
+ AND logs.resource.type= "k8s_container"
+ AND logs.text_payload IS NOT NULL
+ AND REGEXP_CONTAINS(logs.text_payload, "ray_worker_node_id.+Image.+not found$")
+ AND logs.severity = "ERROR"
+ORDER BY
+ timestamp DESC,
+ insert_id DESC
+LIMIT
+ 10000
+```
+
+You should see output like the following:
+![use-log-based-metrics](../../../../docs/images/log-analytics-data-processing.png)
\ No newline at end of file
diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/job.yaml b/best-practices/ml-platform/examples/use-case/data-processing/ray/manifests/job.yaml
similarity index 62%
rename from best-practices/ml-platform/examples/use-case/ray/dataprocessing/job.yaml
rename to best-practices/ml-platform/examples/use-case/data-processing/ray/manifests/job.yaml
index 666a41f90..3e50a7084 100644
--- a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/job.yaml
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/manifests/job.yaml
@@ -1,28 +1,34 @@
apiVersion: batch/v1
kind: Job
metadata:
- name: job
- namespace: ml-team
+ name: data-processing
spec:
backoffLimit: 0
template:
metadata:
labels:
- app: job
+ app: data-processing
spec:
containers:
- name: job
- image: #IMAGE
+ image: V_IMAGE_URL
imagePullPolicy: Always
env:
- name: "PROCESSING_BUCKET"
- value: #PROCESSING_BUCKET
+ value: V_DATA_BUCKET
- name: "RAY_CLUSTER_HOST"
value: ray-cluster-kuberay-head-svc.ml-team:10001
+ resources:
+ requests:
+ cpu: 100m
+ memory: 256Mi
+ limits:
+ cpu: 250m
+ memory: 512Mi
nodeSelector:
resource-type: cpu
restartPolicy: Never
- serviceAccountName: ray-worker
+ serviceAccountName: V_KSA
tolerations:
- effect: NoSchedule
key: on-demand
diff --git a/best-practices/ml-platform/examples/use-case/data-processing/ray/src/.gcloudignore b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/.gcloudignore
new file mode 100644
index 000000000..d66964de3
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/.gcloudignore
@@ -0,0 +1,7 @@
+__pycache__/
+.venv/
+venv/
+
+.gcloudignore
+.python-version
+cloudbuild.yaml
diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/Dockerfile b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/Dockerfile
similarity index 82%
rename from best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/Dockerfile
rename to best-practices/ml-platform/examples/use-case/data-processing/ray/src/Dockerfile
index 8cf00db90..b9df73583 100644
--- a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/Dockerfile
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/Dockerfile
@@ -12,8 +12,7 @@ COPY requirements.txt /venv/requirements.txt
RUN pip install --no-cache-dir -r /venv/requirements.txt
-COPY logging.conf /app/logging.conf
-COPY preprocessing.py /app/preprocessing.py
+COPY custom_json_formatter.py logging.conf preprocessing.py /app/
WORKDIR /app
diff --git a/best-practices/ml-platform/examples/use-case/data-processing/ray/src/cloudbuild.yaml b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/cloudbuild.yaml
new file mode 100644
index 000000000..f5a8a5e4a
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/cloudbuild.yaml
@@ -0,0 +1,12 @@
+images:
+ - ${_DESTINATION}
+options:
+ logging: CLOUD_LOGGING_ONLY
+serviceAccount:
+steps:
+- name: 'gcr.io/cloud-builders/docker'
+ args:
+ - build
+ - -t
+ - ${_DESTINATION}
+ - .
diff --git a/best-practices/ml-platform/examples/use-case/data-processing/ray/src/custom_json_formatter.py b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/custom_json_formatter.py
new file mode 100644
index 000000000..0160054ec
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/custom_json_formatter.py
@@ -0,0 +1,91 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+
+
+class CustomJSONFormatter(logging.Formatter):
+ def __init__(
+ self,
+ fmt=None,
+ datefmt=None,
+ style="%",
+ validate=True,
+ defaults=None,
+ ):
+ super().__init__(
+ fmt=fmt,
+ datefmt=datefmt,
+ style=style,
+ validate=validate,
+ defaults=defaults,
+ )
+
+ self._ignore_keys = {
+ "args",
+ "levelno",
+ "msecs",
+ "msg",
+ "pathname",
+ }
+
+ self._rename_keys = {
+ "created": "timestamp",
+ "levelname": "level",
+ "relativeCreated": "runtime",
+ }
+
+ self._debug_keys = {
+ "filename",
+ "funcName",
+ "lineno",
+ "module",
+ }
+
+ # LogRecord object: https://docs.python.org/3/library/logging.html#logrecord-attributes
+ def format(self, record: logging.LogRecord) -> str:
+ entry = record.__dict__.copy()
+ entry["message"] = record.getMessage()
+
+ # Removed ignored keys
+ for key in self._ignore_keys:
+ entry.pop(key, None)
+
+ # Remove keys based on log level
+ if record.levelno > logging.DEBUG:
+ for key in self._debug_keys:
+ entry.pop(key, None)
+
+ # Rename keys
+ for key, value in list(self._rename_keys.items()):
+ if key in entry:
+ entry[value] = entry.pop(key)
+
+ if record.exc_info:
+ if not record.exc_text:
+ record.exc_text = self.formatException(record.exc_info)
+
+ if record.exc_text:
+ entry["exc_info"] = record.exc_text
+
+ if record.stack_info:
+ entry["stack_info"] = self.formatStack(record.stack_info)
+
+ # Delete keys with None value
+ for key, value in list(entry.items()):
+ if value is None:
+ del entry[key]
+
+ return json.dumps(entry)
diff --git a/best-practices/ml-platform/examples/use-case/data-processing/ray/src/logging.conf b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/logging.conf
new file mode 100644
index 000000000..102ee1944
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/logging.conf
@@ -0,0 +1,24 @@
+[formatters]
+keys=custom_json
+
+[formatter_custom_json]
+class=custom_json_formatter.CustomJSONFormatter
+
+[handlers]
+keys=console
+
+[handler_console]
+class=logging.StreamHandler
+args=(sys.stdout,)
+formatter=custom_json
+
+[loggers]
+keys=root
+
+[logger_root]
+level=INFO
+handlers=console
+
+[logger_preprocessing]
+level=INFO
+handlers=console
diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/preprocessing.py b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/preprocessing.py
similarity index 50%
rename from best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/preprocessing.py
rename to best-practices/ml-platform/examples/use-case/data-processing/ray/src/preprocessing.py
index d17a10ce6..c9e8cca44 100644
--- a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/preprocessing.py
+++ b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/preprocessing.py
@@ -1,39 +1,46 @@
import jsonpickle
import logging
+import logging.config
import os
import pandas as pd
import ray
import re
+import signal
import socket
import spacy
-import sys
import time
import urllib.error
import urllib.request
+import numpy as np
from google.cloud import storage
from google.cloud.storage.retry import DEFAULT_RETRY
from typing import List
-IMAGE_BUCKET = os.environ['PROCESSING_BUCKET']
-RAY_CLUSTER_HOST = os.environ['RAY_CLUSTER_HOST']
-GCS_IMAGE_FOLDER = 'flipkart_images'
+IMAGE_BUCKET = os.environ["PROCESSING_BUCKET"]
+RAY_CLUSTER_HOST = os.environ["RAY_CLUSTER_HOST"]
+GCS_IMAGE_FOLDER = "flipkart_images"
+
+
+def graceful_shutdown(signal_number, stack_frame):
+ signal_name = signal.Signals(signal_number).name
+
+ logger.info(f"Received {signal_name}({signal_number}), shutting down...")
+ # TODO: Add logic to handled checkpointing if required
+ sys.exit(0)
+
-logging.config.fileConfig('logging.conf')
-logger = logging.getLogger('preprocessing')
-logger.debug(logger)
@ray.remote(resources={"cpu": 1})
def get_clean_df(df, logger, ray_worker_node_id):
-
# extract image urls
def extract_url(image_list: str) -> List[str]:
- return image_list.replace('[', '').replace(']', '').replace('"', '').split(',')
+ return image_list.replace("[", "").replace("]", "").replace('"', "").split(",")
- #download the image from public url to GCS
+ # download the image from public url to GCS
def download_image(image_url, image_file_name, destination_blob_name, logger):
storage_client = storage.Client()
- download_dir = '/tmp/images'
+ download_dir = "/tmp/images"
try:
if not os.path.exists(download_dir):
os.makedirs(download_dir)
@@ -49,23 +56,37 @@ def download_image(image_url, image_file_name, destination_blob_name, logger):
bucket = storage_client.bucket(IMAGE_BUCKET)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(download_file, retry=DEFAULT_RETRY)
- logger.info(f"ray_worker_node_id:{ray_worker_node_id} File {image_file_name} uploaded to {destination_blob_name}")
+ logger.info(
+ f"ray_worker_node_id:{ray_worker_node_id} File {image_file_name} uploaded to {destination_blob_name}"
+ )
os.remove(download_file)
return True
except TimeoutError as err:
- logger.warning(f"ray_worker_node_id:{ray_worker_node_id} Image '{image_url}' request timeout")
+ logger.warning(
+ f"ray_worker_node_id:{ray_worker_node_id} Image '{image_url}' request timeout"
+ )
except urllib.error.HTTPError as err:
if err.code == 404:
- logger.warning(f"ray_worker_node_id:{ray_worker_node_id} Image '{image_url}' not found")
+ logger.warning(
+ f"ray_worker_node_id:{ray_worker_node_id} Image '{image_url}' not found"
+ )
elif err.code == 504:
- logger.warning(f"ray_worker_node_id:{ray_worker_node_id} Image '{image_url}' gateway timeout")
+ logger.warning(
+ f"ray_worker_node_id:{ray_worker_node_id} Image '{image_url}' gateway timeout"
+ )
else:
- logger.error(f"ray_worker_node_id:{ray_worker_node_id} Unhandled HTTPError exception: {err}")
+ logger.error(
+ f"ray_worker_node_id:{ray_worker_node_id} Unhandled HTTPError exception: {err}"
+ )
except urllib.error.URLError as err:
- logger.error(f"ray_worker_node_id:{ray_worker_node_id} URLError exception: {err}")
+ logger.error(
+ f"ray_worker_node_id:{ray_worker_node_id} URLError exception: {err}"
+ )
except Exception as err:
- logger.error(f"ray_worker_node_id:{ray_worker_node_id} Unhandled exception: {err}")
+ logger.error(
+ f"ray_worker_node_id:{ray_worker_node_id} Unhandled exception: {err}"
+ )
raise
return False
@@ -74,37 +95,42 @@ def download_image(image_url, image_file_name, destination_blob_name, logger):
def prep_product_desc(df, logger):
spacy.cli.download("en_core_web_sm")
model = spacy.load("en_core_web_sm")
+
def parse_nlp_description(description) -> str:
if not pd.isna(description):
try:
doc = model(description.lower())
lemmas = []
for token in doc:
- if token.lemma_ not in lemmas and not token.is_stop and token.is_alpha:
+ if (
+ token.lemma_ not in lemmas
+ and not token.is_stop
+ and token.is_alpha
+ ):
lemmas.append(token.lemma_)
- return ' '.join(lemmas)
+ return " ".join(lemmas)
except:
logger.error("Unable to load spacy model")
- df['description'] = df['description'].apply(parse_nlp_description)
+ df["description"] = df["description"].apply(parse_nlp_description)
return df
# Extract product attributes as key-value pair
def parse_attributes(specification: str):
spec_match_one = re.compile("(.*?)\\[(.*)\\](.*)")
- spec_match_two = re.compile("(.*?)=>\"(.*?)\"(.*?)=>\"(.*?)\"(.*)")
+ spec_match_two = re.compile('(.*?)=>"(.*?)"(.*?)=>"(.*?)"(.*)')
if pd.isna(specification):
return None
m = spec_match_one.match(specification)
out = {}
if m is not None and m.group(2) is not None:
- phrase = ''
+ phrase = ""
for c in m.group(2):
- if c == '}':
+ if c == "}":
m2 = spec_match_two.match(phrase)
if m2 and m2.group(2) is not None and m2.group(4) is not None:
out[m2.group(2)] = m2.group(4)
- phrase = ''
+ phrase = ""
else:
phrase += c
json_string = jsonpickle.encode(out)
@@ -116,7 +142,7 @@ def get_product_image(df, logger):
gcs_image_url = []
image_found_flag = False
- for id, image_list in zip(df['uniq_id'], df['image']):
+ for id, image_list in zip(df["uniq_id"], df["image"]):
if pd.isnull(image_list): # No image url
logger.warning(f"No image url for product {id}")
@@ -130,10 +156,12 @@ def get_product_image(df, logger):
image_file_name = f"{id}_{index}.jpg"
destination_blob_name = f"{GCS_IMAGE_FOLDER}/{id}_{index}.jpg"
image_found_flag = download_image(
- image_url, image_file_name, destination_blob_name, logger)
+ image_url, image_file_name, destination_blob_name, logger
+ )
if image_found_flag:
gcs_image_url.append(
- 'gs://' + IMAGE_BUCKET + '/' + destination_blob_name)
+ "gs://" + IMAGE_BUCKET + "/" + destination_blob_name
+ )
break
if not image_found_flag:
logger.warning(f"No image found for product {id}")
@@ -145,43 +173,88 @@ def get_product_image(df, logger):
gcs_image_loc = pd.DataFrame(gcs_image_url, index=df.index)
gcs_image_loc.columns = ["image_uri"]
df_with_gcs_image_uri = pd.concat([df, gcs_image_loc], axis=1)
- return df_with_gcs_image_uri
+ return df_with_gcs_image_uri
+
+ # Helper function to reformat the given text
+ def reformat(text: str) -> str:
+ if pd.isnull(text):
+ return ""
+ return text.replace("[", "").replace("]", "").replace('"', "")
+
+ def prep_cat(df: pd.DataFrame) -> pd.DataFrame:
+ df["product_category_tree"] = df["product_category_tree"].apply(
+ lambda x: reformat(x)
+ )
+ temp_df = df["product_category_tree"].str.split(">>", expand=True)
+ max_splits = temp_df.shape[1] # Get the number of columns after splitting
+ # Create column names dynamically
+ column_names = [f"c{i}_name" for i in range(max_splits)]
+ temp_df.columns = column_names
+ for col in temp_df.columns:
+ temp_df[col] = temp_df[col].apply(lambda x: x.strip() if x else x)
+ # concatenating df1 and df2 along rows
+ df_with_cat = pd.concat([df, temp_df], axis=1)
+ df_with_cat = df_with_cat.drop("product_category_tree", axis=1)
+ return df_with_cat
df_with_gcs_image_uri = get_product_image(df, logger)
df_with_desc = prep_product_desc(df_with_gcs_image_uri, logger)
- df_with_desc['attributes'] = df_with_desc['product_specifications'].apply(
- parse_attributes)
-
- return df_with_desc
+ df_with_desc["attributes"] = df_with_desc["product_specifications"].apply(
+ parse_attributes
+ )
+ df_with_desc = df_with_desc.drop("product_specifications", axis=1)
+ result_df = prep_cat(df_with_desc)
+ return result_df
def split_dataframe(df, chunk_size=199):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
- chunks.append(df[i * chunk_size:(i + 1) * chunk_size])
+ chunks.append(df[i * chunk_size : (i + 1) * chunk_size])
return chunks
# This function invokes ray task
def run_remote():
- #Read raw dataset from GCS
+ # Read raw dataset from GCS
df = pd.read_csv(
- f"gs://{IMAGE_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv")
- df = df[['uniq_id',
- 'product_name',
- 'description',
- 'brand',
- 'image',
- 'product_specifications']]
-
- #Ray runtime env
- runtime_env = {"pip": ["google-cloud-storage==2.16.0",
- "spacy==3.7.4",
- "jsonpickle==3.0.3"],
- "env_vars": {"PIP_NO_CACHE_DIR": "1",
- "PIP_DISABLE_PIP_VERSION_CHECK": "1"}}
+ f"gs://{IMAGE_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv"
+ )
+ df = df[
+ [
+ "uniq_id",
+ "product_name",
+ "description",
+ "brand",
+ "image",
+ "product_specifications",
+ "product_category_tree",
+ ]
+ ]
+ print("Original dataset shape:", df.shape)
+ # Drop rows with null values in specified columns
+ df.dropna(
+ subset=[
+ "description",
+ "image",
+ "product_specifications",
+ "product_category_tree",
+ ],
+ inplace=True,
+ )
+ print("After dropping null values:", df.shape)
+ # Ray runtime env
+ runtime_env = {
+ "pip": [
+ "google-cloud-storage==2.16.0",
+ "spacy==3.7.4",
+ "jsonpickle==3.0.3",
+ "pandas==2.2.1",
+ ],
+ "env_vars": {"PIP_NO_CACHE_DIR": "1", "PIP_DISABLE_PIP_VERSION_CHECK": "1"},
+ }
# Initiate a driver: start and connect with Ray cluster
if RAY_CLUSTER_HOST != "local":
@@ -189,35 +262,41 @@ def run_remote():
logger.debug(ClientContext)
# Get the ID of the node where the driver process is running
- driver_process_node_id = ray.get_runtime_context().get_node_id() #HEX
+ driver_process_node_id = ray.get_runtime_context().get_node_id() # HEX
logger.debug(f"ray_driver_node_id={driver_process_node_id}")
-
+
logger.debug(ray.cluster_resources())
else:
RayContext = ray.init()
logger.debug(RayContext)
- #Chunk the dataset
+ # Chunk the dataset
res = split_dataframe(df)
- logger.debug('Data Preparation started')
+ logger.debug("Data Preparation started")
start_time = time.time()
results = ray.get([get_clean_df.remote(res[i], logger, i) for i in range(len(res))])
duration = time.time() - start_time
logger.debug(f"Data Preparation finished in {duration} seconds")
- #Disconnect the worker, and terminate processes started by ray.init()
+ # Disconnect the worker, and terminate processes started by ray.init()
ray.shutdown()
- #Store the preprocessed data into GCS
+ # concat all the resulting data frames
result_df = pd.concat(results, axis=0, ignore_index=True)
- result_df.to_csv('gs://'+IMAGE_BUCKET +
- '/flipkart_preprocessed_dataset/flipkart.csv', index=False)
+ # Replace NaN with None
+ result_df = result_df.replace({np.nan: None})
+
+ # Store the preprocessed data into GCS
+ result_df.to_csv(
+ "gs://" + IMAGE_BUCKET + "/flipkart_preprocessed_dataset/flipkart.csv",
+ index=False,
+ )
return result_df
def main():
- logger.info('Started')
+ logger.info("Started")
logger.debug(f"RAY_CLUSTER_HOST={RAY_CLUSTER_HOST}")
logger.debug(f"IMAGE_BUCKET={IMAGE_BUCKET}")
@@ -225,9 +304,25 @@ def main():
clean_df = run_remote()
- logger.info('Finished')
+ logger.info("Finished")
if __name__ == "__main__":
- """ This is executed when run from the command line """
+ # Configure logging
+ logging.config.fileConfig("logging.conf")
+
+ logger = logging.getLogger("preprocessing")
+
+ if "LOG_LEVEL" in os.environ:
+ new_log_level = os.environ["LOG_LEVEL"].upper()
+ logger.info(
+ f"Log level set to '{new_log_level}' via LOG_LEVEL environment variable"
+ )
+ logging.getLogger().setLevel(new_log_level)
+ logger.setLevel(new_log_level)
+
+ logger.info("Configure signal handlers")
+ signal.signal(signal.SIGINT, graceful_shutdown)
+ signal.signal(signal.SIGTERM, graceful_shutdown)
+
main()
diff --git a/best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/requirements.txt b/best-practices/ml-platform/examples/use-case/data-processing/ray/src/requirements.txt
similarity index 100%
rename from best-practices/ml-platform/examples/use-case/ray/dataprocessing/src/requirements.txt
rename to best-practices/ml-platform/examples/use-case/data-processing/ray/src/requirements.txt
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/DEVELOPER.md b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/DEVELOPER.md
new file mode 100644
index 000000000..e8c8641c7
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/DEVELOPER.md
@@ -0,0 +1,114 @@
+# Fine Tuning PyTorch Developer Guide
+
+- Install [`pyenv`](https://github.com/pyenv/pyenv?tab=readme-ov-file#installation)
+
+- Install the `python` version
+
+ ```
+ pyenv install 3.10.14
+ ```
+
+- Clone the repository
+
+ ```
+ git clone https://github.com/GoogleCloudPlatform/ai-on-gke.git && \
+ cd ai-on-gke
+ ```
+
+- Change directory to the `src` directory
+
+ ```
+ cd best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src
+ ```
+
+- Set the local `python` version
+
+ ```
+ pyenv local 3.10.14
+ ```
+
+- Create a virtual environment
+
+ ```
+ python -m venv venv
+ ```
+
+- Activate the virtual environment
+
+ ```
+ source venv/bin/activate
+ ```
+
+- Install the requirements
+
+ ```
+ pip install --no-cache-dir -r requirements.txt
+ ```
+
+- Set environment variables
+
+ ```
+ export MLP_ENVIRONMENT_NAME=
+ export MLP_PROJECT_ID=
+ export HF_TOKEN_FILE=${HOME}/secrets/mlp-hugging-face-token
+
+ export PROJECT_ID=${MLP_PROJECT_ID}
+ gcloud config set project ${PROJECT_ID}
+ export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
+
+ export TRAINING_DATASET_BUCKET="${PROJECT_ID}-${MLP_ENVIRONMENT_NAME}-processing"
+ export MODEL_BUCKET="${PROJECT_ID}-${MLP_ENVIRONMENT_NAME}-model"
+ export CLUSTER_NAME="mlp-${MLP_ENVIRONMENT_NAME}"
+ export NAMESPACE=ml-team
+ export KSA=default
+ export HF_TOKEN=$(tr --delete '\n' <${HF_TOKEN_FILE})
+ export DOCKER_IMAGE_URL="us-docker.pkg.dev/${PROJECT_ID}/${MLP_ENVIRONMENT_NAME}-llm-finetuning/finetune:v1.0.0"
+
+ export EXPERIMENT=""
+ export MLFLOW_ENABLE="false"
+ export MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING="false"
+ export MLFLOW_TRACKING_URI=""
+ export TRAINING_DATASET_PATH="dataset/output/training"
+ export MODEL_PATH="/model-data/model-gemma2/experiment"
+ export MODEL_NAME="google/gemma-2-9b-it"
+ export ACCELERATOR="l4"
+ ```
+
+- Install `gcsfuse`
+
+ ```
+ export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s` && \
+ echo "deb [signed-by=/usr/share/keyrings/cloud.google.asc] https://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list && \
+ curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo tee /usr/share/keyrings/cloud.google.asc && \
+ sudo apt-get update && \
+ sudo apt-get install -y gcsfuse
+ ```
+
+- Create the model bucket mount directory
+
+ ```
+ export MOUNT_DIR="/model-data"
+ sudo mkdir -p ${MOUNT_DIR}
+ sudo chown ${USER} ${MOUNT_DIR}
+ ```
+
+- Mount your model bucket using `gcsfuse`
+
+ ```
+ gcsfuse ${MODEL_BUCKET} ${MOUNT_DIR}
+ ```
+
+- Run the `fine_tune.py` script
+
+ ```
+ accelerate launch \
+ --config_file fsdp_config.yaml \
+ --debug \
+ fine_tune.py
+ ```
+
+- Unmount the model bucket
+
+ ```
+ fusermount -u ${MOUNT_DIR}
+ ```
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/README.md b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/README.md
new file mode 100644
index 000000000..106a8bc83
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/README.md
@@ -0,0 +1,156 @@
+# Fine-tuning
+
+Fine-tune a Gemma Instruction Tuned model using a flipkart processed catalog. The dataset used
+for fine-tuning is generated by the Vertex AI Gemini Flash model. The fine-tuned model can be deployed
+with an inference serving engine.
+
+## Prerequisites
+
+- This guide was developed to be run on the [playground machine learning platform](/best-practices/ml-platform/examples/platform/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
+- A bucket containing the processed and prepared data from the [Data Preparation example](../../data-preparation/gemma-it)
+
+## Preparation
+
+- Clone the repository and change directory to the guide directory
+
+ ```
+ git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
+ cd ai-on-gke/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch
+ ```
+
+- Ensure that your `MLP_ENVIRONMENT_FILE` is configured
+
+ ```
+ cat ${MLP_ENVIRONMENT_FILE} && \
+ source ${MLP_ENVIRONMENT_FILE}
+ ```
+
+ > You should see the various variables populated with the information specific to your environment.
+
+### Access token variables
+
+- Set `HF_TOKEN` to your HuggingFace access token. Go to https://huggingface.co/settings/tokens , click `Create new token` , provide a token name, select `Read` in token type and click `Create token`.
+
+ ```
+ HF_TOKEN=
+ ```
+
+## Build the container image
+
+- Build the container image using Cloud Build and push the image to Artifact Registry
+
+ ```
+ cd src
+ sed -i -e "s|^serviceAccount:.*|serviceAccount: projects/${MLP_PROJECT_ID}/serviceAccounts/${MLP_BUILD_GSA}|" cloudbuild.yaml
+ gcloud beta builds submit --config cloudbuild.yaml \
+ --project ${MLP_PROJECT_ID} \
+ --substitutions _DESTINATION=${MLP_FINE_TUNING_IMAGE}
+ cd ..
+ ```
+
+## Run the job
+
+- Get credentials for the GKE cluster
+
+ ```sh
+ gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID}
+ ```
+
+- Create a Kubernetes secret with your HuggingFace token
+
+ ```sh
+ kubectl create secret generic hf-secret \
+ --from-literal=hf_api_token=${HF_TOKEN} \
+ --dry-run=client -o yaml | kubectl apply -n ${MLP_KUBERNETES_NAMESPACE} -f -
+ ```
+
+- Accept the license HuggingFace license for the model
+
+ - Go to https://huggingface.co/google/gemma-2-9b-it and accept the license
+
+- Configure the job
+
+ | Variable | Description | Example |
+ | ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
+ | ACCELERATOR | Type of GPU accelerator to use (l4, a100, h100) | l4 |
+ | DATA_BUCKET_DATASET_PATH | The path where the generated prompt data is for fine-tuning. | dataset/output/training |
+ | EXPERIMENT | If MLflow is enabled. experiment ID used in MLflow | experiment- |
+ | HF_BASE_MODEL_NAME | The Hugging Face path to the base model for fine-tuning. | google/gemma-2-9b-it |
+ | MLFLOW_ENABLE | Enable MLflow, empty will also disable | true/false |
+ | MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING | If MLflow is enabled, track system level metrics, CPU/Memory/GPU | true/false |
+ | MLFLOW_TRACKING_URI | If MLflow is enabled, the tracking server URI | |
+ | MODEL_PATH | The output folder path for the fine-tuned model. This location will be used by the inference serving engine and model evaluation. | /model-data/model-gemma2/experiment |
+
+ ```sh
+ ACCELERATOR="l4"
+ DATA_BUCKET_DATASET_PATH="dataset/output/training"
+ EXPERIMENT="finetune-experiment"
+ HF_BASE_MODEL_NAME="google/gemma-2-9b-it"
+ MLFLOW_ENABLE="true"
+ MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING="true"
+ MLFLOW_TRACKING_URI="http://mlflow-tracking-svc:5000"
+ MODEL_PATH="/model-data/model-gemma2/experiment"
+ ```
+
+ ```sh
+ sed \
+ -i -e "s|V_DATA_BUCKET|${MLP_DATA_BUCKET}|" \
+ -i -e "s|V_EXPERIMENT|${EXPERIMENT}|" \
+ -i -e "s|V_MODEL_NAME|${HF_BASE_MODEL_NAME}|" \
+ -i -e "s|V_IMAGE_URL|${MLP_FINE_TUNING_IMAGE}|" \
+ -i -e "s|V_KSA|${MLP_FINE_TUNING_KSA}|" \
+ -i -e "s|V_MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING|${MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING}|" \
+ -i -e "s|V_MLFLOW_ENABLE|${MLFLOW_ENABLE}|" \
+ -i -e "s|V_MLFLOW_TRACKING_URI|${MLFLOW_TRACKING_URI}|" \
+ -i -e "s|V_MODEL_BUCKET|${MLP_MODEL_BUCKET}|" \
+ -i -e "s|V_MODEL_PATH|${MODEL_PATH}|" \
+ -i -e "s|V_TRAINING_DATASET_PATH|${DATA_BUCKET_DATASET_PATH}|" \
+ manifests/fine-tune-${ACCELERATOR}-dws.yaml
+ ```
+
+- Create the provisioning request and job
+
+ ```sh
+ kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/provisioning-request-${ACCELERATOR}.yaml
+ kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/fine-tune-${ACCELERATOR}-dws.yaml
+ ```
+
+- Verify the completion of the job
+
+ In the Google Cloud console, go to the [Logs Explorer](https://console.cloud.google.com/logs) page to run the following query to see the completion of the job.
+
+ ```sh
+ labels."k8s-pod/app"="finetune-job"
+ textPayload: "finetune - INFO - ### Completed ###"
+ ```
+
+- After the fine-tuning job is successful, the model bucket should have a checkpoint folder created.
+
+ ```sh
+ gcloud storage ls gs://${MLP_MODEL_BUCKET}/${MODEL_PATH}
+ ```
+
+## Observability
+
+Besides the logs and metrics provided by Google Cloud Observability, it's also important to track the fine-tuning job and its results.
+
+There are many existing options for this. As an example, we choose to use [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html) to keep track of running the ML workloads. The MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.
+
+When you use the [playgroud configuration](../../../platform/playground/README.md), MLflow Tracking has been installed for you.
+
+You can run the following command to get the URL:
+
+```sh
+echo -e "\n${MLP_KUBERNETES_NAMESPACE} MLFlow Tracking URL: ${MLP_MLFLOW_TRACKING_NAMESPACE_ENDPOINT}\n"
+```
+Read [this playground README section](../../../platform/playground/README.md#software-installed-via-reposync-and-rootsync) for more info.
+
+__Note:__ You can set the variable `MLFLOW_ENABLE` to `false` or leave it empty to disable MLflow Tracking.
+
+MLflow Tracking is protected by IAP. After you log in, you should see a page similar to the following.
+
+![mlflow-home](../../../../docs/images/mlflow-home.png)
+
+All successful experiments should appear. If you click into a completed run, you can see an overview page with metric tabs.
+
+![mlflow-model-experiment](../../../../docs/images/mlflow-model-experiment.png)
\ No newline at end of file
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-a100-dws.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-a100-dws.yaml
new file mode 100644
index 000000000..36e1197cc
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-a100-dws.yaml
@@ -0,0 +1,113 @@
+apiVersion: v1
+kind: Service
+metadata:
+ name: headless-svc-a100
+spec:
+ clusterIP: None # clusterIP must be None to create a headless service
+ selector:
+ job-name: finetune-gemma-a100 # must match Job name
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: finetune-gemma-a100
+spec:
+ backoffLimit: 0
+ completions: 1
+ parallelism: 1
+ completionMode: Indexed
+ template:
+ metadata:
+ labels:
+ app: finetune-job
+ annotations:
+ gke-gcsfuse/volumes: "true"
+ gke-gcsfuse/memory-limit: "35Gi"
+ cluster-autoscaler.kubernetes.io/consume-provisioning-request: a100-job
+ cluster-autoscaler.kubernetes.io/provisioning-class-name: "queued-provisioning.gke.io"
+ spec:
+ containers:
+ - name: gpu-job
+ imagePullPolicy: Always
+ image: V_IMAGE_URL
+ ports:
+ - containerPort: 29500
+ securityContext:
+ privileged: true
+ resources:
+ requests:
+ nvidia.com/gpu: "2"
+ limits:
+ nvidia.com/gpu: "2"
+ command:
+ - bash
+ - -c
+ - |
+ accelerate launch \
+ --config_file fsdp_config.yaml \
+ --debug \
+ --main_process_ip finetune-gemma-a100-0.headless-svc-a100 \
+ --main_process_port 29500 \
+ --machine_rank ${JOB_COMPLETION_INDEX} \
+ --num_processes 2 \
+ --num_machines 1 \
+ fine_tune.py
+ env:
+ - name: "EXPERIMENT"
+ value: "V_EXPERIMENT"
+ - name: "MLFLOW_ENABLE"
+ value: "V_MLFLOW_ENABLE"
+ - name: "MLFLOW_TRACKING_URI"
+ value: "V_MLFLOW_TRACKING_URI"
+ - name: "MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"
+ value: "V_MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"
+ - name: "TRAINING_DATASET_BUCKET"
+ value: "V_DATA_BUCKET"
+ - name: "TRAINING_DATASET_PATH"
+ value: "V_TRAINING_DATASET_PATH"
+ - name: MODEL_NAME
+ value: "V_MODEL_NAME"
+ - name: NEW_MODEL
+ value: "gemma-ft"
+ - name: MODEL_PATH
+ value: "V_MODEL_PATH"
+ - name: EPOCHS
+ value: "1"
+ - name: NCCL_DEBUG
+ value: "INFO"
+ - name: HF_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: hf-secret
+ key: hf_api_token
+ volumeMounts:
+ - mountPath: /dev/shm
+ name: dshm
+ - name: gcs-fuse-csi-ephemeral
+ mountPath: /model-data
+ readOnly: false
+ nodeSelector:
+ cloud.google.com/gke-accelerator: nvidia-tesla-a100
+ restartPolicy: OnFailure
+ serviceAccountName: V_KSA
+ subdomain: headless-svc-a100
+ terminationGracePeriodSeconds: 600
+ tolerations:
+ - key: "nvidia.com/gpu"
+ operator: "Exists"
+ effect: "NoSchedule"
+ - key: "on-demand"
+ value: "true"
+ operator: "Equal"
+ effect: "NoSchedule"
+ volumes:
+ - name: dshm
+ emptyDir:
+ medium: Memory
+ - name: gcs-fuse-csi-ephemeral
+ csi:
+ driver: gcsfuse.csi.storage.gke.io
+ volumeAttributes:
+ bucketName: V_MODEL_BUCKET
+ mountOptions: "implicit-dirs"
+ gcsfuseLoggingSeverity: warning
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-h100-dws.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-h100-dws.yaml
new file mode 100644
index 000000000..f93d7a645
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-h100-dws.yaml
@@ -0,0 +1,114 @@
+apiVersion: v1
+kind: Service
+metadata:
+ name: headless-svc-h100
+spec:
+ clusterIP: None # clusterIP must be None to create a headless service
+ selector:
+ job-name: finetune-gemma-h100 # must match Job name
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: finetune-gemma-h100
+spec:
+ backoffLimit: 0
+ completions: 1
+ parallelism: 1
+ completionMode: Indexed
+ template:
+ metadata:
+ labels:
+ app: finetune-job
+ annotations:
+ gke-gcsfuse/volumes: "true"
+ gke-gcsfuse/memory-limit: "35Gi"
+ cluster-autoscaler.kubernetes.io/consume-provisioning-request: h100-job
+ cluster-autoscaler.kubernetes.io/provisioning-class-name: "queued-provisioning.gke.io"
+ spec:
+ containers:
+ - name: gpu-job
+ imagePullPolicy: Always
+ image: V_IMAGE_URL
+ ports:
+ - containerPort: 29500
+ securityContext:
+ privileged: true
+ resources:
+ requests:
+ nvidia.com/gpu: "8"
+ limits:
+ nvidia.com/gpu: "8"
+ command:
+ - bash
+ - -c
+ - |
+ accelerate launch \
+ --config_file fsdp_config.yaml \
+ --debug \
+ --main_process_ip finetune-gemma-h100-0.headless-svc-h100 \
+ --main_process_port 29500 \
+ --machine_rank ${JOB_COMPLETION_INDEX} \
+ --num_processes 8 \
+ --num_machines 1 \
+ fine_tune.py
+ env:
+ - name: "EXPERIMENT"
+ value: "V_EXPERIMENT"
+ - name: "MLFLOW_ENABLE"
+ value: "V_MLFLOW_ENABLE"
+ - name: "MLFLOW_TRACKING_URI"
+ value: "V_MLFLOW_TRACKING_URI"
+ - name: "MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"
+ value: "V_MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"
+ - name: "TRAINING_DATASET_BUCKET"
+ value: "V_DATA_BUCKET"
+ - name: "TRAINING_DATASET_PATH"
+ value: "V_TRAINING_DATASET_PATH"
+ - name: MODEL_NAME
+ value: "V_MODEL_NAME"
+ - name: NEW_MODEL
+ value: "gemma-ft"
+ - name: MODEL_PATH
+ value: "V_MODEL_PATH"
+ - name: EPOCHS
+ value: "1"
+ - name: NCCL_DEBUG
+ value: "INFO"
+ - name: HF_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: hf-secret
+ key: hf_api_token
+ volumeMounts:
+ - mountPath: /dev/shm
+ name: dshm
+ - name: gcs-fuse-csi-ephemeral
+ mountPath: /model-data
+ readOnly: false
+ nodeSelector:
+ cloud.google.com/gke-accelerator: nvidia-h100-80gb
+ resource-model: h100
+ restartPolicy: OnFailure
+ serviceAccountName: V_KSA
+ subdomain: headless-svc-h100
+ terminationGracePeriodSeconds: 600
+ tolerations:
+ - key: "nvidia.com/gpu"
+ operator: "Exists"
+ effect: "NoSchedule"
+ - key: "on-demand"
+ value: "true"
+ operator: "Equal"
+ effect: "NoSchedule"
+ volumes:
+ - name: dshm
+ emptyDir:
+ medium: Memory
+ - name: gcs-fuse-csi-ephemeral
+ csi:
+ driver: gcsfuse.csi.storage.gke.io
+ volumeAttributes:
+ bucketName: V_MODEL_BUCKET
+ mountOptions: "implicit-dirs"
+ gcsfuseLoggingSeverity: warning
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-l4-dws.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-l4-dws.yaml
new file mode 100644
index 000000000..5e24b2dac
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/fine-tune-l4-dws.yaml
@@ -0,0 +1,111 @@
+apiVersion: v1
+kind: Service
+metadata:
+ name: headless-svc-l4
+spec:
+ clusterIP: None # clusterIP must be None to create a headless service
+ selector:
+ job-name: finetune-gemma-l4 # must match Job name
+---
+apiVersion: batch/v1
+kind: Job
+metadata:
+ name: finetune-gemma-l4
+spec:
+ backoffLimit: 0
+ completions: 2
+ parallelism: 2
+ completionMode: Indexed
+ template:
+ metadata:
+ labels:
+ app: finetune-job
+ annotations:
+ gke-gcsfuse/volumes: "true"
+ gke-gcsfuse/memory-limit: "35Gi"
+ cluster-autoscaler.kubernetes.io/consume-provisioning-request: l4-job
+ cluster-autoscaler.kubernetes.io/provisioning-class-name: "queued-provisioning.gke.io"
+ spec:
+ containers:
+ - name: gpu-job
+ imagePullPolicy: Always
+ image: V_IMAGE_URL
+ ports:
+ - containerPort: 29500
+ securityContext:
+ privileged: true
+ resources:
+ requests:
+ nvidia.com/gpu: "2"
+ limits:
+ nvidia.com/gpu: "2"
+ command:
+ - bash
+ - -c
+ - |
+ accelerate launch \
+ --config_file fsdp_config.yaml \
+ --debug \
+ --main_process_ip finetune-gemma-l4-0.headless-svc-l4 \
+ --main_process_port 29500 \
+ --machine_rank ${JOB_COMPLETION_INDEX} \
+ --num_processes 4 \
+ --num_machines 2 \
+ fine_tune.py
+ env:
+ - name: "EXPERIMENT"
+ value: "V_EXPERIMENT"
+ - name: "MLFLOW_ENABLE"
+ value: "V_MLFLOW_ENABLE"
+ - name: "MLFLOW_TRACKING_URI"
+ value: "V_MLFLOW_TRACKING_URI"
+ - name: "MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"
+ value: "V_MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING"
+ - name: "TRAINING_DATASET_BUCKET"
+ value: "V_DATA_BUCKET"
+ - name: "TRAINING_DATASET_PATH"
+ value: "V_TRAINING_DATASET_PATH"
+ - name: MODEL_NAME
+ value: "V_MODEL_NAME"
+ - name: NEW_MODEL
+ value: "gemma-ft"
+ - name: MODEL_PATH
+ value: "V_MODEL_PATH"
+ - name: EPOCHS
+ value: "1"
+ - name: HF_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: hf-secret
+ key: hf_api_token
+ volumeMounts:
+ - mountPath: /dev/shm
+ name: dshm
+ - name: gcs-fuse-csi-ephemeral
+ mountPath: /model-data
+ readOnly: false
+ nodeSelector:
+ cloud.google.com/gke-accelerator: nvidia-l4
+ restartPolicy: OnFailure
+ serviceAccountName: V_KSA
+ subdomain: headless-svc-l4
+ terminationGracePeriodSeconds: 600
+ tolerations:
+ - key: "nvidia.com/gpu"
+ operator: "Exists"
+ effect: "NoSchedule"
+ - key: "on-demand"
+ value: "true"
+ operator: "Equal"
+ effect: "NoSchedule"
+ volumes:
+ - name: dshm
+ emptyDir:
+ medium: Memory
+ - name: gcs-fuse-csi-ephemeral
+ csi:
+ driver: gcsfuse.csi.storage.gke.io
+ volumeAttributes:
+ bucketName: V_MODEL_BUCKET
+ mountOptions: "implicit-dirs"
+ gcsfuseLoggingSeverity: warning
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-a100.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-a100.yaml
new file mode 100644
index 000000000..1c280d52b
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-a100.yaml
@@ -0,0 +1,43 @@
+apiVersion: v1
+kind: PodTemplate
+metadata:
+ name: a100-job
+ namespace: ml-team
+template:
+ spec:
+ nodeSelector:
+ cloud.google.com/gke-nodepool: gpu-a100x2-a2h2-dws
+ tolerations:
+ - key: "nvidia.com/gpu"
+ operator: "Exists"
+ effect: "NoSchedule"
+ - key: "on-demand"
+ value: "true"
+ operator: "Equal"
+ effect: "NoSchedule"
+ containers:
+ - name: pi
+ image: perl
+ command: ["/bin/sh"]
+ resources:
+ limits:
+ cpu: "700m"
+ nvidia.com/gpu: 2
+ requests:
+ cpu: "700m"
+ nvidia.com/gpu: 2
+ restartPolicy: Never
+---
+apiVersion: autoscaling.x-k8s.io/v1beta1
+kind: ProvisioningRequest
+metadata:
+ name: a100-job
+ namespace: ml-team
+spec:
+ provisioningClassName: queued-provisioning.gke.io
+ parameters:
+ maxRunDurationSeconds: "86400"
+ podSets:
+ - count: 1
+ podTemplateRef:
+ name: a100-job
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-h100.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-h100.yaml
new file mode 100644
index 000000000..54a19c775
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-h100.yaml
@@ -0,0 +1,43 @@
+apiVersion: v1
+kind: PodTemplate
+metadata:
+ name: h100-job
+ namespace: ml-team
+template:
+ spec:
+ nodeSelector:
+ cloud.google.com/gke-nodepool: gpu-h100x8-a3h8-dws
+ tolerations:
+ - key: "nvidia.com/gpu"
+ operator: "Exists"
+ effect: "NoSchedule"
+ - key: "on-demand"
+ value: "true"
+ operator: "Equal"
+ effect: "NoSchedule"
+ containers:
+ - name: pi
+ image: perl
+ command: ["/bin/sh"]
+ resources:
+ limits:
+ cpu: "700m"
+ nvidia.com/gpu: 8
+ requests:
+ cpu: "700m"
+ nvidia.com/gpu: 8
+ restartPolicy: Never
+---
+apiVersion: autoscaling.x-k8s.io/v1beta1
+kind: ProvisioningRequest
+metadata:
+ name: h100-job
+ namespace: ml-team
+spec:
+ provisioningClassName: queued-provisioning.gke.io
+ parameters:
+ maxRunDurationSeconds: "86400"
+ podSets:
+ - count: 1
+ podTemplateRef:
+ name: h100-job
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-l4.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-l4.yaml
new file mode 100644
index 000000000..59870c03e
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/manifests/provisioning-request-l4.yaml
@@ -0,0 +1,43 @@
+apiVersion: v1
+kind: PodTemplate
+metadata:
+ name: l4-job
+ namespace: ml-team
+template:
+ spec:
+ nodeSelector:
+ cloud.google.com/gke-nodepool: gpu-l4x2-g2s24-dws
+ tolerations:
+ - key: "nvidia.com/gpu"
+ operator: "Exists"
+ effect: "NoSchedule"
+ - key: "on-demand"
+ value: "true"
+ operator: "Equal"
+ effect: "NoSchedule"
+ containers:
+ - name: pi
+ image: perl
+ command: ["/bin/sh"]
+ resources:
+ limits:
+ cpu: "700m"
+ nvidia.com/gpu: 2
+ requests:
+ cpu: "700m"
+ nvidia.com/gpu: 2
+ restartPolicy: Never
+---
+apiVersion: autoscaling.x-k8s.io/v1beta1
+kind: ProvisioningRequest
+metadata:
+ name: l4-job
+ namespace: ml-team
+spec:
+ provisioningClassName: queued-provisioning.gke.io
+ parameters:
+ maxRunDurationSeconds: "86400"
+ podSets:
+ - count: 2
+ podTemplateRef:
+ name: l4-job
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/.gcloudignore b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/.gcloudignore
new file mode 100644
index 000000000..098a48b46
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/.gcloudignore
@@ -0,0 +1,8 @@
+__pycache__/
+.venv/
+mlruns/
+venv/
+
+.gcloudignore
+.python-version
+cloudbuild.yaml
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/Dockerfile b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/Dockerfile
new file mode 100644
index 000000000..df5f93300
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/Dockerfile
@@ -0,0 +1,26 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+FROM pytorch/pytorch:2.3.1-cuda12.1-cudnn8-devel
+
+RUN apt-get update && \
+ apt-get install -y --no-install-recommends git && \
+ rm -rf /var/lib/apt/lists/*
+
+COPY requirements.txt /workspace/
+
+RUN pip3 install --no-cache-dir -r requirements.txt
+RUN CUDA_HOME=/usr/local/cuda-12.1 pip3 install --no-cache-dir flash-attn --no-build-isolation
+
+COPY custom_json_formatter.py fine_tune.py fsdp_config.yaml logging.conf /workspace/
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/cloudbuild.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/cloudbuild.yaml
new file mode 100644
index 000000000..f5a8a5e4a
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/cloudbuild.yaml
@@ -0,0 +1,12 @@
+images:
+ - ${_DESTINATION}
+options:
+ logging: CLOUD_LOGGING_ONLY
+serviceAccount:
+steps:
+- name: 'gcr.io/cloud-builders/docker'
+ args:
+ - build
+ - -t
+ - ${_DESTINATION}
+ - .
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/custom_json_formatter.py b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/custom_json_formatter.py
new file mode 100644
index 000000000..0160054ec
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/custom_json_formatter.py
@@ -0,0 +1,91 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import logging
+
+
+class CustomJSONFormatter(logging.Formatter):
+ def __init__(
+ self,
+ fmt=None,
+ datefmt=None,
+ style="%",
+ validate=True,
+ defaults=None,
+ ):
+ super().__init__(
+ fmt=fmt,
+ datefmt=datefmt,
+ style=style,
+ validate=validate,
+ defaults=defaults,
+ )
+
+ self._ignore_keys = {
+ "args",
+ "levelno",
+ "msecs",
+ "msg",
+ "pathname",
+ }
+
+ self._rename_keys = {
+ "created": "timestamp",
+ "levelname": "level",
+ "relativeCreated": "runtime",
+ }
+
+ self._debug_keys = {
+ "filename",
+ "funcName",
+ "lineno",
+ "module",
+ }
+
+ # LogRecord object: https://docs.python.org/3/library/logging.html#logrecord-attributes
+ def format(self, record: logging.LogRecord) -> str:
+ entry = record.__dict__.copy()
+ entry["message"] = record.getMessage()
+
+ # Removed ignored keys
+ for key in self._ignore_keys:
+ entry.pop(key, None)
+
+ # Remove keys based on log level
+ if record.levelno > logging.DEBUG:
+ for key in self._debug_keys:
+ entry.pop(key, None)
+
+ # Rename keys
+ for key, value in list(self._rename_keys.items()):
+ if key in entry:
+ entry[value] = entry.pop(key)
+
+ if record.exc_info:
+ if not record.exc_text:
+ record.exc_text = self.formatException(record.exc_info)
+
+ if record.exc_text:
+ entry["exc_info"] = record.exc_text
+
+ if record.stack_info:
+ entry["stack_info"] = self.formatStack(record.stack_info)
+
+ # Delete keys with None value
+ for key, value in list(entry.items()):
+ if value is None:
+ del entry[key]
+
+ return json.dumps(entry)
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/fine_tune.py b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/fine_tune.py
new file mode 100644
index 000000000..ef7144496
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/fine_tune.py
@@ -0,0 +1,396 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import datasets
+import logging
+import logging.config
+import os
+import signal
+import sys
+import torch
+import transformers
+import random
+
+from accelerate import Accelerator
+from datasets import Dataset, load_dataset, load_from_disk
+from peft import LoraConfig, PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import DataCollatorForCompletionOnlyLM, SFTConfig, SFTTrainer
+import torch.distributed as dist
+
+
+def graceful_shutdown(signal_number, stack_frame):
+ signal_name = signal.Signals(signal_number).name
+
+ logger.info(f"Received {signal_name}({signal_number}), shutting down...")
+ # TODO: Add logic to handled checkpointing if required
+ sys.exit(0)
+
+
+def formatting_prompts_func(example):
+ output_texts = []
+ for i in range(len(example["prompt"])):
+ text = f"""{example["prompt"][i]}\n{EOS_TOKEN}"""
+ output_texts.append(text)
+ return {"prompts": output_texts}
+
+
+def get_current_node_id_and_rank():
+ if dist.is_initialized():
+ logger.info("Distributed training enabled.")
+ logger.info("Calculating node id")
+ global_rank = dist.get_rank() # Get the process's rank
+ logger.info(f"global_rank: {global_rank}")
+ gpu_per_node = torch.cuda.device_count()
+ logger.info(f"gpu_per_node: {gpu_per_node}")
+ total_gpus = accelerator.state.num_processes
+ logger.info(f"total_gpus: {total_gpus}")
+ total_nodes = int(total_gpus / gpu_per_node)
+ logger.info(f"total_nodes: {total_nodes}")
+ node_id = global_rank // total_nodes
+ else:
+ logger.info("Distributed training enabled.")
+ node_id = 0
+ global_rank = 0
+ logger.info(f"node_id: {node_id}")
+ return (node_id, global_rank, gpu_per_node)
+
+
+if __name__ == "__main__":
+ # Configure logging
+ logging.config.fileConfig("logging.conf")
+
+ logger = logging.getLogger("finetune")
+
+ if "LOG_LEVEL" in os.environ:
+ new_log_level = os.environ["LOG_LEVEL"].upper()
+ logger.info(
+ f"Log level set to '{new_log_level}' via LOG_LEVEL environment variable"
+ )
+ logging.getLogger().setLevel(new_log_level)
+ logger.setLevel(new_log_level)
+
+ datasets.disable_progress_bar()
+ transformers.utils.logging.disable_progress_bar()
+
+ logger.info("Configure signal handlers")
+ signal.signal(signal.SIGINT, graceful_shutdown)
+ signal.signal(signal.SIGTERM, graceful_shutdown)
+
+ accelerator = Accelerator()
+
+ if "MLFLOW_ENABLE" in os.environ and os.environ["MLFLOW_ENABLE"] == "true":
+ import mlflow
+
+ remote_server_uri = os.environ["MLFLOW_TRACKING_URI"]
+ mlflow.set_tracking_uri(remote_server_uri)
+
+ experiment_name = os.environ["EXPERIMENT"]
+ if accelerator.is_main_process: # Only the main process sets the experiment
+ # Check if the experiment already exists
+ experiment = mlflow.get_experiment_by_name(experiment_name)
+ if experiment is None:
+ # Experiment doesn't exist, create it
+ try:
+ experiment_id = mlflow.create_experiment(experiment_name)
+ experiment = mlflow.get_experiment(experiment_id)
+ print(
+ f"Created new experiment: {experiment.name} (ID: {experiment.experiment_id})"
+ )
+ except Exception as ex:
+ logger.error(f"Create experiment failed: {ex}")
+ else:
+ # Experiment already exists, use it
+ logger.info(
+ f"Using existing experiment: {experiment.name} (ID: {experiment.experiment_id})"
+ )
+
+ # Barrier to ensure all processes wait until the experiment is created
+ accelerator.wait_for_everyone()
+ experiment = mlflow.get_experiment_by_name(experiment_name)
+
+ # Get the node ID
+ node, rank, gpu_per_node = get_current_node_id_and_rank()
+ node_id = "node_" + str(node)
+ logger.info(f"Training at: {node_id} process: {rank}")
+ process_id = "process_" + str(rank)
+
+ # Set MLflow experiment and node ID (within the appropriate run context)
+ mlflow.set_experiment(experiment_name)
+ mlflow.set_system_metrics_node_id(node_id)
+
+ # Check and create/reuse runs
+ # Only one process per node creates a run to capture system metrics
+ if (rank % gpu_per_node) == 0:
+ existing_run = mlflow.search_runs(
+ experiment_ids=[experiment.experiment_id],
+ filter_string=f"tags.mlflow.runName = '{process_id}'",
+ )
+ # No existing run with this name
+ if len(existing_run) == 0:
+ run = mlflow.start_run(
+ run_name=node_id,
+ experiment_id=experiment.experiment_id,
+ )
+ has_active_ml_flow_run = True
+ else:
+ logger.info(f"Run with name '{process_id}' already exists")
+ client = mlflow.MlflowClient()
+ data = client.get_run(mlflow.active_run().info.run_id).data
+ logger.info(f"Active run details: '{data}'")
+
+ # logging of model parameters, metrics, and artifacts
+ mlflow.autolog()
+
+ # The bucket which contains the training data
+ training_data_bucket = os.environ["TRAINING_DATASET_BUCKET"]
+
+ training_data_path = os.environ["TRAINING_DATASET_PATH"]
+
+ # The model that you want to train from the Hugging Face hub
+ model_name = os.environ["MODEL_NAME"]
+
+ # Fine-tuned model name
+ new_model = os.environ["NEW_MODEL"]
+
+ # The root path of where the fine-tuned model will be saved
+ save_model_path = os.environ["MODEL_PATH"]
+
+ # Load tokenizer
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+ tokenizer.pad_token = tokenizer.eos_token
+ tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
+
+ EOS_TOKEN = tokenizer.eos_token
+
+ training_dataset = load_from_disk(
+ f"gs://{training_data_bucket}/{training_data_path}"
+ )
+
+ logger.info("Data Formatting Started")
+ input_data = training_dataset.map(formatting_prompts_func, batched=True)
+ logger.info("Data Formatting Completed")
+
+ INPUT_OUTPUT_DELIMITER = "model"
+
+ collator = DataCollatorForCompletionOnlyLM(
+ INPUT_OUTPUT_DELIMITER, tokenizer=tokenizer
+ )
+
+ ################################################################################
+ # QLoRA parameters
+ ################################################################################
+
+ # LoRA attention dimension
+ lora_r = int(os.getenv("LORA_R", "8"))
+
+ # Alpha parameter for LoRA scaling
+ lora_alpha = int(os.getenv("LORA_ALPHA", "16"))
+
+ # Dropout probability for LoRA layers
+ lora_dropout = float(os.getenv("LORA_DROPOUT", "0.1"))
+
+ ################################################################################
+ # TrainingArguments parameters
+ ################################################################################
+
+ # Output directory where the model predictions and checkpoints will be stored
+ # output_dir = "./results"
+
+ # Number of training epochs
+ num_train_epochs = int(os.getenv("EPOCHS", "1"))
+
+ # Enable fp16/bf16 training (set bf16 to True with an A100)
+ fp16 = False
+ bf16 = False
+
+ # Batch size per GPU for training
+ per_device_train_batch_size = int(os.getenv("TRAIN_BATCH_SIZE", "1"))
+
+ # Batch size per GPU for evaluation
+ per_device_eval_batch_size = 1
+
+ # Number of update steps to accumulate the gradients for
+ gradient_accumulation_steps = int(os.getenv("GRADIENT_ACCUMULATION_STEPS", "1"))
+
+ # Enable gradient checkpointing
+ gradient_checkpointing = True
+
+ # Maximum gradient normal (gradient clipping)
+ max_grad_norm = float(os.getenv("MAX_GRAD_NORM", "0.3"))
+
+ # Initial learning rate (AdamW optimizer)
+ learning_rate = float(os.getenv("LEARNING_RATE", "2e-4"))
+
+ # Weight decay to apply to all layers except bias/LayerNorm weights
+ weight_decay = float(os.getenv("WEIGHT_DECAY", "0.001"))
+
+ # Optimizer to use
+ optim = "paged_adamw_32bit"
+
+ # Learning rate schedule
+ lr_scheduler_type = "cosine"
+
+ # Number of training steps (overrides num_train_epochs)
+ max_steps = -1
+
+ # Ratio of steps for a linear warmup (from 0 to learning rate)
+ warmup_ratio = float(os.getenv("WARMUP_RATIO", "0.03"))
+
+ # Group sequences into batches with same length
+ # Saves memory and speeds up training considerably
+ group_by_length = True
+
+ # Save checkpoint every X updates steps
+ save_steps = 0
+
+ # Log every X updates steps
+ logging_steps = 50
+
+ ################################################################################
+ # SFT parameters
+ ################################################################################
+
+ # Maximum sequence length to use
+ max_seq_length = int(os.getenv("MAX_SEQ_LENGTH", "512"))
+
+ # Pack multiple short examples in the same input sequence to increase efficiency
+ packing = False
+
+ # Load base model
+ logger.info("Loading base model started")
+ model = AutoModelForCausalLM.from_pretrained(
+ attn_implementation="eager",
+ pretrained_model_name_or_path=model_name,
+ torch_dtype=torch.bfloat16,
+ )
+ model.config.use_cache = False
+ model.config.pretraining_tp = 1
+ logger.info("Loading base model completed")
+
+ logger.info("Configuring fine tuning started")
+ # Load LoRA configuration
+ peft_config = LoraConfig(
+ lora_alpha=lora_alpha,
+ lora_dropout=lora_dropout,
+ r=lora_r,
+ bias="none",
+ task_type="CAUSAL_LM",
+ target_modules=[
+ "q_proj",
+ "k_proj",
+ "v_proj",
+ "o_proj",
+ "gate_proj",
+ "up_proj",
+ "down_proj",
+ ],
+ )
+
+ # Set training parameters
+ training_arguments = SFTConfig(
+ bf16=bf16,
+ dataset_kwargs={
+ "add_special_tokens": False, # We template with special tokens
+ "append_concat_token": False, # No need to add additional separator token
+ },
+ dataset_text_field="prompts",
+ disable_tqdm=True,
+ fp16=fp16,
+ gradient_accumulation_steps=gradient_accumulation_steps,
+ gradient_checkpointing=gradient_checkpointing,
+ gradient_checkpointing_kwargs={"use_reentrant": False},
+ group_by_length=group_by_length,
+ log_on_each_node=False,
+ logging_steps=logging_steps,
+ learning_rate=learning_rate,
+ lr_scheduler_type=lr_scheduler_type,
+ max_grad_norm=max_grad_norm,
+ max_seq_length=max_seq_length,
+ max_steps=max_steps,
+ num_train_epochs=num_train_epochs,
+ optim=optim,
+ output_dir=save_model_path,
+ packing=packing,
+ per_device_train_batch_size=per_device_train_batch_size,
+ save_steps=save_steps,
+ warmup_ratio=warmup_ratio,
+ weight_decay=weight_decay,
+ )
+ logger.info("Configuring fine tuning completed")
+
+ logger.info("Creating trainer started")
+ trainer = SFTTrainer(
+ args=training_arguments,
+ data_collator=collator,
+ model=model,
+ peft_config=peft_config,
+ tokenizer=tokenizer,
+ train_dataset=input_data,
+ )
+ logger.info("Creating trainer completed")
+
+ logger.info("Fine tuning started")
+ trainer.train()
+ logger.info("Fine tuning completed")
+
+ if "MLFLOW_ENABLE" in os.environ and os.environ["MLFLOW_ENABLE"] == "true":
+ if accelerator.is_main_process: # register the model only at main process
+ mv = mlflow.register_model(
+ model_uri=f"gs://{training_data_bucket}/{save_model_path}",
+ name=new_model,
+ )
+ logger.info(f"Name: {mv.name}")
+ logger.info(f"Version: {mv.version}")
+
+ logger.info("Saving new model started")
+ trainer.model.save_pretrained(new_model)
+ logger.info("Saving new model completed")
+
+ logger.info("Merging the new model with base model started")
+ # Reload model in FP16 and merge it with LoRA weights
+ base_model = AutoModelForCausalLM.from_pretrained(
+ low_cpu_mem_usage=True,
+ pretrained_model_name_or_path=model_name,
+ return_dict=True,
+ torch_dtype=torch.bfloat16,
+ )
+ model = PeftModel.from_pretrained(
+ model=base_model,
+ model_id=new_model,
+ )
+ model = model.merge_and_unload()
+ logger.info("Merging the new model with base model completed")
+
+ logger.info("Accelerate unwrap model started")
+ unwrapped_model = accelerator.unwrap_model(model)
+ logger.info("Accelerate unwrap model completed")
+
+ logger.info("Save unwrapped model started")
+ unwrapped_model.save_pretrained(
+ is_main_process=accelerator.is_main_process,
+ save_directory=save_model_path,
+ save_function=accelerator.save,
+ )
+ logger.info("Save unwrapped model completed")
+
+ logger.info("Save new tokenizer started")
+ if accelerator.is_main_process:
+ tokenizer.save_pretrained(save_model_path)
+ logger.info("Save new tokenizer completed")
+
+ # Barrier to ensure all processes wait until 'unwrapped model save' is completed
+ accelerator.wait_for_everyone()
+
+ logger.info("Script completed")
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/fsdp_config.yaml b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/fsdp_config.yaml
new file mode 100644
index 000000000..549e3e34d
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/fsdp_config.yaml
@@ -0,0 +1,41 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+compute_environment: LOCAL_MACHINE
+debug: true
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+ fsdp_activation_checkpointing: false
+ fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+ fsdp_backward_prefetch: BACKWARD_PRE
+ fsdp_cpu_ram_efficient_loading: false
+ fsdp_forward_prefetch: false
+ fsdp_offload_params: false
+ fsdp_sharding_strategy: FULL_SHARD
+ fsdp_state_dict_type: SHARDED_STATE_DICT
+ fsdp_sync_module_states: false
+ fsdp_use_orig_params: true
+machine_rank: 0
+main_training_function: main
+mixed_precision: 'no'
+num_machines: 1
+num_processes: 2
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
\ No newline at end of file
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/logging.conf b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/logging.conf
new file mode 100644
index 000000000..cf7d576ed
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/logging.conf
@@ -0,0 +1,24 @@
+[formatters]
+keys=custom_json
+
+[formatter_custom_json]
+class=custom_json_formatter.CustomJSONFormatter
+
+[handlers]
+keys=console
+
+[handler_console]
+class=logging.StreamHandler
+args=(sys.stdout,)
+formatter=custom_json
+
+[loggers]
+keys=root
+
+[logger_root]
+level=INFO
+handlers=console
+
+[logger_finetune]
+level=INFO
+handlers=console
diff --git a/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/requirements.txt b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/requirements.txt
new file mode 100644
index 000000000..5b82e115a
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/fine-tuning/pytorch/src/requirements.txt
@@ -0,0 +1,12 @@
+accelerate==v0.31.0
+bitsandbytes==0.43.1
+datasets==2.19.2
+einops==0.8.0
+gcsfs==2024.3.1
+mlflow==2.14.1
+peft==v0.11.1
+psutil==6.0.0
+pynvml==11.5.0
+torch==2.3.1
+transformers==v4.42.3
+trl==v0.9.4
diff --git a/best-practices/ml-platform/examples/use-case/model-eval/README.md b/best-practices/ml-platform/examples/use-case/model-eval/README.md
new file mode 100644
index 000000000..137c7df28
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/model-eval/README.md
@@ -0,0 +1,125 @@
+# Model evaluation and validation
+
+Once a model has completed fine-tuning, the model must be validated for precision and accuracy
+against the dataset used to fine-tune the model. In this example, the model is deployed on an
+inference serving engine to host the model for the model validaiton to take place. Two steps are performed
+for this activity, the first is to send prompts to the fine-tuned model, the second is to validate the results.
+
+## Prerequisites
+
+- This guide was developed to be run on the [playground machine learning platform](/best-practices/ml-platform/examples/platform/playground/README.md). If you are using a different environment the scripts and manifest will need to be modified for that environment.
+- A bucket containing the model weights from the [Fine tuning example](../../fine-tuning/pytorch)
+
+## Preparation
+
+- Clone the repository and change directory to the guide directory
+
+ ```
+ git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
+ cd ai-on-gke/best-practices/ml-platform/examples/use-case/model-eval
+ ```
+
+- Ensure that your `MLP_ENVIRONMENT_FILE` is configured
+
+ ```
+ cat ${MLP_ENVIRONMENT_FILE} && \
+ source ${MLP_ENVIRONMENT_FILE}
+ ```
+
+ > You should see the various variables populated with the information specific to your environment.
+
+## Build the container image
+
+- Build container image using Cloud Build and push the image to Artifact Registry
+
+ ```
+ cd src
+ sed -i -e "s|^serviceAccount:.*|serviceAccount: projects/${MLP_PROJECT_ID}/serviceAccounts/${MLP_BUILD_GSA}|" cloudbuild.yaml
+ gcloud beta builds submit --config cloudbuild.yaml \
+ --project ${MLP_PROJECT_ID} \
+ --substitutions _DESTINATION=${MLP_MODEL_EVALUATION_IMAGE}
+ cd ..
+ ```
+
+## Run the job
+
+- Get credentials for the GKE cluster
+
+ ```sh
+ gcloud container fleet memberships get-credentials ${MLP_CLUSTER_NAME} --project ${MLP_PROJECT_ID}
+ ```
+
+- Configure the deployment
+
+ | Variable | Description | Example |
+ | -------------- | ----------------------------------------------- | ---------------------------------------- |
+ | ACCELERATOR | Type of GPU accelerator to use (l4, a100) | l4 |
+ | VLLM_IMAGE_URL | The image url for the vllm image | vllm/vllm-openai:v0.5.3.post1 |
+ | MODEL | The output folder path for the fine-tuned model | /model-data/model-gemma2-a100/experiment |
+
+ ```sh
+ ACCELERATOR="l4"
+ VLLM_IMAGE_URL="vllm/vllm-openai:v0.5.3.post1"
+ MODEL="/model-data/model-gemma2/experiment"
+ ```
+
+ ```sh
+ sed \
+ -i -e "s|V_IMAGE_URL|${VLLM_IMAGE_URL}|" \
+ -i -e "s|V_KSA|${MLP_MODEL_EVALUATION_KSA}|" \
+ -i -e "s|V_BUCKET|${MLP_MODEL_BUCKET}|" \
+ -i -e "s|V_MODEL_PATH|${MODEL}|" \
+ manifests/deployment-${ACCELERATOR}.yaml
+ ```
+
+- Create the deployment
+
+ ```sh
+ kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/deployment-${ACCELERATOR}.yaml
+ ```
+
+- Wait for the deployment to be ready
+
+ ```
+ kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} wait --for=condition=ready --timeout=900s pod -l app=vllm-openai
+ ```
+
+ When they deployment is ready your should see output similar to:
+
+ ```output
+ pod/vllm-openai-XXXXXXXXXX-XXXXX condition met
+ ```
+
+- Configure the job
+
+ | Variable | Description | Example |
+ | ------------------- | --------------------------------------------------------------------------------------------------------- | ---------------------------------------------- |
+ | DATASET_OUTPUT_PATH | The folder path of the generated output data set. | dataset/output |
+ | ENDPOINT | This is the endpoint URL of the inference server | http://vllm-openai-l4:8000/v1/chat/completions |
+ | MODEL_PATH | The output folder path for the fine-tuned model. This is used by model evaluation to generate the prompt. | /model-data/model-gemma2/experiment |
+ | PREDICTIONS_FILE | The predictions file | predictions.txt |
+
+ ```sh
+ DATASET_OUTPUT_PATH="dataset/output"
+ ENDPOINT="http://vllm-openai-${ACCELERATOR}:8000/v1/chat/completions"
+ MODEL_PATH="/model-data/model-gemma2/experiment"
+ PREDICTIONS_FILE="predictions.txt"
+ ```
+
+ ```sh
+ sed \
+ -i -e "s|V_DATA_BUCKET|${MLP_DATA_BUCKET}|" \
+ -i -e "s|V_DATASET_OUTPUT_PATH|${DATASET_OUTPUT_PATH}|" \
+ -i -e "s|V_ENDPOINT|${ENDPOINT}|" \
+ -i -e "s|V_IMAGE_URL|${MLP_MODEL_EVALUATION_IMAGE}|" \
+ -i -e "s|V_KSA|${MLP_MODEL_EVALUATION_KSA}|" \
+ -i -e "s|V_MODEL_PATH|${MODEL_PATH}|" \
+ -i -e "s|V_PREDICTIONS_FILE|${PREDICTIONS_FILE}|" \
+ manifests/job.yaml
+ ```
+
+- Create the job
+
+ ```sh
+ kubectl --namespace ${MLP_KUBERNETES_NAMESPACE} apply -f manifests/job.yaml
+ ```
diff --git a/best-practices/ml-platform/examples/use-case/model-eval/examples/gemma11_job1.txt b/best-practices/ml-platform/examples/use-case/model-eval/examples/gemma11_job1.txt
new file mode 100644
index 000000000..c36de824e
--- /dev/null
+++ b/best-practices/ml-platform/examples/use-case/model-eval/examples/gemma11_job1.txt
@@ -0,0 +1,17145 @@
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a casual, comfortable bra for lounging around the house. Do you have any recommendations?model
+Output:
+The Grafion by Grafion - Comfort Feel Women's Full Coverage Bra is a great option! It's wire-free, made of cotton, and comes in a solid color.
+ Product Name: Grafion by Grafion - Comfort Feel Women's Full Coverage Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: White
+- color: White
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Beach Wear, Lounge Wear, Casual
+- Wire Support: Wirefree
+- Closure: Hook
+- Straps: Regular
+- Number of Contents in Sales Package: Pack of 1
+- Cup Type: Non Padded
+- Fabric: Cotton
+- Seam Type: Seamed
+- Type: Full Coverage Bra
+- Series: Fashion
+- Neck: V Neck
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I want a printed shirt for everyday wear, any suggestions?model
+Output:
+The Tokyo Talkies Women's Printed Casual Shirt is a great choice, it's comfortable and stylish, perfect for everyday wear.
+ Product Name: Tokyo Talkies Women's Printed Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Printed
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Fabric: 100% Polyester
+- Collar: Mandarin Collar
+- Fit: Regular
+- Style Code: Tttp000630 Green
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a casual, printed bra that's comfortable, any suggestions?model
+Output:
+The S4S Printed Women's Full Coverage Bra is a good choice, it's made of cotton and comes in a pack of two.
+ Product Name: S4S Printed Women's Full Coverage Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Blue, Red
+- color: Blue, Red
+- Pattern: Printed
+- Ideal For: Women's
+- Occasion: Casual
+- Wire Support: Wirefree
+- Inner Lining: Cotton Lining
+- Straps: Regular
+- Detachable Straps: No
+- Number of Contents in Sales Package: Pack of 2
+- Cup Type: Non Padded
+- Fabric: Cotton
+- Type: Full Coverage Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: What's the fabric of the dupatta in the Parisha Cotton Self Design, Printed Salwar Suit Dupatta Material?model
+Output:
+The dupatta is made of cotton.
+ Product Name: Parisha Cotton Self Design, Printed Salwar Suit Dupatta Material
+Product Category: Women's Clothing
+Product Details:
+- Dupatta Fabric: Cotton
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Top Fabric: Cotton
+- Type: Salwar Suit Dupatta Material
+- Bottom Fabric: Cotton
+- Top Length: 40 inch
+- Top Width: 70 inch
+- Pattern: Self Design, Printed
+- Ideal For: Women's
+- Occasion: Formal, Casual
+- Color: Pink, Multicolor
+- Style Code: PVSCNZ1011
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: How many t-shirts are included in the Candy House Solid Men's Polo Neck T-shirt pack?model
+Output:
+It's a pack of 4.
+ Product Name: Candy House Solid Men's Polo Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 4
+- Fabric: POLYCOTTON
+- Type: Polo Neck
+- Pockets: NO
+- Fit: Regular
+- Pattern: Solid
+- Ideal For: Men's
+- Occasion: Casual
+- Style Code: POLO4_BLK_RBLU_YLO_GRY
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Is there a stylish, solid casual shirt for women that's made from viscose?model
+Output:
+The Ishin Women's Solid Casual Shirt is a stylish and versatile option, it's made from viscose and can be dressed up or down depending on the occasion.
+ Product Name: Ishin Women's Solid Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Slim
+- Fabric: Viscose
+- Fit: Slim
+- Style Code: INDWT-5112
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Are there any camisoles that come in a pack of two?model
+Output:
+Miss Clyra Women's Camisole comes in a pack of two and is made of cotton.
+ Product Name: Miss Clyra Women's Camisole
+Product Category: Women's Clothing
+Product Details:
+- Number of Contents in Sales Package: Pack of 2
+- Fabric: Cotton
+- Type: Camisole
+- Pattern: Solid
+- Ideal For: Women's
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a casual, half-sleeve graphic print t-shirt for men, any recommendations?model
+Output:
+You might like the Ocean Race Graphic Print Men's Round Neck T-Shirt, it's made of 100% cotton and has a regular fit.
+ Product Name: Ocean Race Graphic Print Men's Round Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Regular
+- Pattern: Graphic Print
+- Occasion: Casual
+- Ideal For: Men's
+- Style Code: OCR-1040
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a comfy, basic camisole for everyday wear. Is this LIENZ camisole made of cotton?model
+Output:
+Yes, it's made of cotton.
+ Product Name: LIENZ Women's Camisole
+Product Category: Women's Clothing
+Product Details:
+- Number of Contents in Sales Package: Pack of 2
+- Fabric: Cotton
+- Type: Camisole
+- Pattern: Solid
+- Ideal For: Women's
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: What's a good tube bra for parties or sports that comes in a pack of two?model
+Output:
+The Our Rituals CDMR28 Women's Tube Bra is a great option, it's a pack of two and comes in a vibrant red and magenta color combination.
+ Product Name: Our Rituals CDMR28 Women's Tube Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Majenta, Red
+- color: Red, Red
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Party, Sports
+- Wire Support: Wirefree
+- Inner Lining: Cotton Lining
+- Straps: Strapless
+- Detachable Straps: No
+- Number of Contents in Sales Package: Pack of 2
+- Cup Type: Non Padded
+- Fabric: Lycra, Cotton
+- Type: Tube Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a grey t-shirt with 3/4 sleeves. Does the Candies by Pantaloons Solid Women's Round Neck T-Shirt come in grey?model
+Output:
+Yes, the Candies by Pantaloons Solid Women's Round Neck T-Shirt is available in grey.
+ Product Name: Candies by Pantaloons Solid Women's Round Neck T-Shirt
+Product Category: Women's Clothing
+Product Details:
+- Sleeve: 3/4 Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Regular
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Casual
+- Style Code: 110001760grey Melange
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: How many t-shirts are included in the Candy House Solid Men's Polo Neck T-shirt pack?model
+Output:
+It's a pack of 3 t-shirts.
+ Product Name: Candy House Solid Men's Polo Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 3
+- Fabric: POLYCOTTON
+- Type: Polo Neck
+- Pockets: NO
+- Fit: Regular
+- Pattern: Solid
+- Ideal For: Men's
+- Occasion: Casual
+- Style Code: POLO3RN_GRN_GRY
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a stylish shirt for a party, something with a bit of flair. Any suggestions?model
+Output:
+The Meira Women's Solid Party Shirt is a great choice! It's a slim fit, has a Nehru collar and is made of cotton.
+ Product Name: Meira Women's Solid Party Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Occasion: Party
+- Ideal For: Women's
+- Closure: Button
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Slim
+- Collar: Nehru Collar
+- Fabric: Cotton
+- Fit: Slim
+- Style Code: MEWT-1156-D-Multi
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a pink bra for a special occasion. Do you have any recommendations?model
+Output:
+The Urbaano Women's Full Coverage Bra in pink is a great choice for a special occasion. It's a wire-free, non-padded bra with a polka dot pattern and pink lace accents. It's available online at Flipkart.com.
+ Product Name: Urbaano Women's Full Coverage Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Pink
+- color: Pink
+- Pattern: Polka Print
+- Ideal For: Women's
+- Occasion: Wedding, Party, Formal
+- Wire Support: Wirefree
+- Straps: Regular
+- Number of Contents in Sales Package: Pack of 1
+- Cup Type: Molded Cups
+- Fabric: Cotton Lycra
+- Type: Full Coverage Bra
+- Seam Type: Seamed
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a comfortable, embroidered shirt for everyday wear. Any suggestions?model
+Output:
+The Tokyo Talkies Women's Embroidered Casual Shirt is made of 100% cotton and has a mandarin collar, making it a comfortable and stylish option for everyday wear.
+ Product Name: Tokyo Talkies Women's Embroidered Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Embroidered
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Fabric: 100% Cotton
+- Collar: Mandarin Collar
+- Type: Shirt
+- Fit: Regular
+- Style Code: TTSH000057BAKED CLAY
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a short, casual shrug for women with a unique woven design. What do you recommend?model
+Output:
+The Vanca Casual Short Sleeve Solid Women's Shrug might be perfect for you! It's a solid color shrug with a woven design and a shawl collar.
+ Product Name: The Vanca Casual Short Sleeve Solid Women's Shrug
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+- Closure: No Closure
+- Sleeve: Short Sleeve
+- Collar: Shawl Collar
+- Fabric: Polyester Spandex
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for comfortable men's track pants with a cool print. What are some options?model
+Output:
+TAMBAS Printed Men's Track Pants are a great choice, they're made of 100% cotton and come in a variety of prints.
+ Product Name: TAMBAS Printed Men's Track Pants
+Product Category: Men's Clothing
+Product Details:
+- Closure: Thread and Elastane
+- Fabric: 100% Cotton
+- Pockets: 2 Kangaroo Pocket At Front and 1 Pocket backside
+- Pattern: Printed
+- Ideal For: Men's
+- Style Code: Basiden DM SS16
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a full-length skirt with a tie closure, do you have any recommendations?model
+Output:
+The American Swan Solid Women's Wrap Around Skirt is a full-length, solid-colored skirt with a tie closure, it's made of 100% cotton and has an elastic waistband.
+ Product Name: American Swan Solid Women's Wrap Around Skirt
+Product Category: Women's Clothing
+Product Details:
+- Closure: Tie
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: 100% Cotton
+- Type: Wrap Around
+- Waistband: Elastic
+- Belt Loops: No
+- Belt: No
+- Length: Full Length
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a bra that I can wear under a variety of outfits, from casual to formal. Does the UR Image by Ur Image - Fashion Women's Full Coverage Bra work for different occasions?model
+Output:
+Absolutely! This bra is designed for everyday wear and can be styled with various outfits, from casual beach wear to formal events.
+ Product Name: UR Image by Ur Image - Fashion Women's Full Coverage Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Multicolor
+- Age Group: NA - NA month
+- color: Multicolor
+- Pattern: Self Design, Solid
+- Ideal For: Women's
+- Occasion: Beach Wear, Casual, Festive, Formal, Lounge Wear, Party, Sports, Wedding
+- Wire Support: Wirefree
+- Straps: Regular
+- Detachable Straps: No
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Seam Type: Seamless
+- Type: Full Coverage Bra
+- Series: Women's
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a pack of two t-shirts made of cotton, in a slim fit. Any suggestions?model
+Output:
+The Go India Store Solid Women's Round Neck T-Shirt comes in a pack of two, is made of cotton, and has a slim fit.
+ Product Name: Go India Store Solid Women's Round Neck T-Shirt
+Product Category: Women's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 2
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Slim
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Casual
+- Style Code: TSHW00001450BL-R
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a stylish and casual printed shirt for women, what's a good option?model
+Output:
+The Tokyo Talkies Women's Printed Casual Shirt is a great choice, it's made of 100% polyester and comes in a variety of prints.
+ Product Name: Tokyo Talkies Women's Printed Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Printed
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Collar: Mandarin Collar
+- Fabric: 100% Polyester
+- Type: Shirt
+- Fit: Regular
+- Style Code: TTSH000025DARK GREY
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a black bra that offers full coverage and is non-padded. Do you have any recommendations?model
+Output:
+The Vaishna Fashion Women's Full Coverage Bra is a non-padded, full coverage bra that comes in black.
+ Product Name: Vaishna Fashion Women's Full Coverage Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Black
+- color: Black
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+- Wire Support: Underwire
+- Straps: Regular
+- Number of Contents in Sales Package: Pack of 2
+- Cup Type: Non Padded Cups
+- Fabric: Cotton
+- Type: Full Coverage Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a nightgown that's both comfy and stylish, what do you recommend?model
+Output:
+The Simrit Women's Nighty is a great option, it's made of cotton and comes in a printed pattern.
+ Product Name: Simrit Women's Nighty
+Product Category: Women's Clothing
+Product Details:
+- Number of Contents in Sales Package: 2
+- Fabric: Cotton
+- Type: Nighty
+- Pattern: Printed
+- Ideal For: Women's
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: Are these briefs designed for men?model
+Output:
+Yes, the Selfcare Men's Brief is designed for men.
+ Product Name: Selfcare Men's Brief
+Product Category: Men's Clothing
+Product Details:
+- Number of Contents in Sales Package: Pack of 5
+- Fabric: Cotton
+- Type: Hip Brief
+- Waistband: Elastic Waistband
+- Ideal For: Men's
+- Style Code: 645
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I need a casual sweater for men, but I don't want anything with a hood. Any ideas?model
+Output:
+The Club York Solid V-neck Casual Men's Sweater is a great option, it's a full sleeve, solid color sweater with a V-neck and is perfect for casual wear.
+ Product Name: Club York Solid V-neck Casual Men's Sweater
+Product Category: Men's Clothing
+Product Details:
+- Hooded: No
+- Closure: N/A
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Reversible: No
+- Fabric: 100% Cotton
+- Neck: V-neck
+- Pattern: Solid
+- Ideal For: Men's
+- Occasion: Casual
+- Other Details: Best Comfortable Sweater
+- Style Code: CY51
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a comfortable, casual V-neck t-shirt for men. What's a good option?model
+Output:
+The Numero Uno Solid Men's V-neck T-Shirt is a great choice, it's made of cotton and has a slim fit.
+ Product Name: Numero Uno Solid Men's V-neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: V-neck
+- Fit: Slim Fit
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Men's
+- Style Code: NMFNHZ441_Grey..F
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a comfortable and stylish red top for work, any suggestions?model
+Output:
+The Imagination Casual Full Sleeve Solid Women's Red Top is a great option, it's made of acrylic and has a round neck and button closure.
+ Product Name: Imagination Casual Full Sleeve Solid Women's Red Top
+Product Category: Women's Clothing
+Product Details:
+- Ideal For: Women's
+- Occasion: Casual
+- Pattern: Solid
+- Type: Top
+- Fabric: Acrylic
+- Neck: Round Neck
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: What's a good, affordable sweater for men that's not hooded?model
+Output:
+The Reveller Striped V-neck Casual Men's Sweater is a great option! It's priced at Rs. 1,199 and is made of cotton.
+ Product Name: Reveller Striped V-neck Casual Men's Sweater
+Product Category: Men's Clothing
+Product Details:
+- Hooded: No
+- Sleeve: Full Sleeve
+- Reversible: No
+- Fabric: Cotton
+- Neck: V-neck
+- Pattern: Striped
+- Occasion: Casual
+- Ideal For: Men's
+- Style Code: 15001E
+---
+Prompt:
+userContext: Online shopping for Kids' ClothingQuestion: I'm looking for a cute green jumpsuit for my daughter, something casual and comfortable. Any recommendations?model
+Output:
+You might like the Kothari Girl's Jumpsuit, it's green, sleeveless, and made of cotton.
+ Product Name: Kothari Girl's Jumpsuit
+Product Category: Kids' Clothing
+Product Details:
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Girl's
+- Sleeve: Sleeveless
+- Fabric: Cotton
+- Neck: Round Neck
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a women's top that's casual and sleeveless. Is there anything with a printed pattern?model
+Output:
+The Shilpkala Casual Sleeveless Printed Women's Top is a great option, it's a light blue top with a printed pattern and a round neck.
+ Product Name: Shilpkala Casual Sleeveless Printed Women's Top
+Product Category: Women's Clothing
+Product Details:
+- Ideal For: Women's
+- Occasion: Casual
+- Pattern: Printed
+- Fabric: Crepe
+- Neck: Round Neck
+- Sleeve: Sleeveless
+- Belt Included: No
+- Number of Contents in Sales Package: Pack of 1
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a comfortable, casual top with 3/4 sleeves. What's a good option?model
+Output:
+The Tapyti Casual 3/4 Sleeve Solid Women's Top is a great choice! It's made of polyester and has a round neck.
+ Product Name: Tapyti Casual 3/4 Sleeve Solid Women's Top
+Product Category: Women's Clothing
+Product Details:
+- Ideal For: Women's
+- Occasion: Casual
+- Pattern: Solid
+- Type: Tops
+- Fabric: Polyester
+- Neck: Round
+- Sleeve: 3/4 Sleeve
+- Number of Contents in Sales Package: Pack of 1
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a comfortable and stylish top for everyday wear. What's a good option?model
+Output:
+The esoft Casual Sleeveless Solid Women's Top is a great choice, it's made of rayon and has a round neck.
+ Product Name: esoft Casual Sleeveless Solid Women's Top
+Product Category: Women's Clothing
+Product Details:
+- Sleeve: Sleeveless
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Rayon
+- Type: Top
+- Neck: Round Neck
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Casual
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a casual, reversible shirt for women, any suggestions?model
+Output:
+The Antilia Femme Women's Solid Casual Reversible Shirt is a great option, it's available in blue and comes in a pack of one.
+ Product Name: Antilia Femme Women's Solid Casual Reversible Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Fabric: Viscose
+- Fit: Regular
+- Style Code: UFOSH005405
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a comfortable, wire-free bra for everyday wear. What are some good options that are brown or black?model
+Output:
+The S4S Stylish Women's Push-up Bra is a good option. It's wire-free, made of cotton, and comes in brown and black.
+ Product Name: S4S Stylish Women's Push-up Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Brown, Black
+- color: Brown, Black
+- Pattern: Printed
+- Occasion: Casual
+- Ideal For: Women's
+- Inner Lining: Cotton Lining
+- Wire Support: Wirefree
+- Detachable Straps: No
+- Straps: Regular
+- Number of Contents in Sales Package: Pack of 2
+- Cup Type: Non Padded
+- Fabric: Cotton
+- Type: Push-up Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a comfortable, casual shirt for everyday wear, any suggestions?model
+Output:
+The Tokyo Talkies Women's Printed Casual Shirt is a good choice, it's made of 100% polyester and has a regular fit.
+ Product Name: Tokyo Talkies Women's Printed Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Printed
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Fabric: 100% Polyester
+- Collar: Mandarin Collar
+- Fit: Regular
+- Style Code: TTSH000057BAKED CLAY
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a half-sleeve shirt for women with a self design, any suggestions?model
+Output:
+The Tokyo Talkies Women's Self Design Casual Shirt is a half-sleeve shirt with a self design, it's made of 100% polyester and has a mandarin collar.
+ Product Name: Tokyo Talkies Women's Self Design Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Self Design
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Fabric: 100% Polyester
+- Collar: Mandarin Collar
+- Type: Shirt
+- Fit: Regular
+- Style Code: TTSH000083PEACOAT NAVY
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a short, casual dress with a fun print. What are some good options?model
+Output:
+You might like the Just Wow Women's Sheath Dress, it's a mini/short, printed dress with a round neck and half sleeves.
+ Product Name: Just Wow Women's Sheath Dress
+Product Category: Women's Clothing
+Product Details:
+- Length: Mini/Short
+- Pattern: Printed
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Crepe
+- Type: Sheath
+- Neck: Round Neck
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a comfortable, wire-free bra for everyday wear. Is the Clovia Women's T-Shirt Bra a good option?model
+Output:
+The Clovia Women's T-Shirt Bra is wire-free and made of cotton, making it a comfortable choice for everyday wear.
+ Product Name: Clovia Women's T-Shirt Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Pink
+- Age Group: NA - NA month
+- color: Pink
+- Pattern: Solid
+- Ideal For: Women's
+- Wire Support: Wirefree
+- Detachable Straps: No
+- Straps: Regular
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: T-Shirt Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a women's denim shirt that's slim fit and has full sleeves, any suggestions?model
+Output:
+The People Women's Solid Casual Denim Shirt is a slim fit, full sleeve shirt made of denim, it might be what you're looking for.
+ Product Name: People Women's Solid Casual Denim Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Slim Fit
+- Fabric: Denim
+- Fit: Slim
+- Style Code: P20402166050015
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I need a casual, half-sleeve t-shirt for men. What's a good option?model
+Output:
+The Numero Uno Solid Men's Round Neck T-Shirt is a good choice. It's made of cotton, has a slim fit, and is perfect for casual occasions.
+ Product Name: Numero Uno Solid Men's Round Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Slim
+- Pattern: Solid
+- Ideal For: Men's
+- Occasion: Casual
+- Style Code: NMFNHZ447_Green..F
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a jumpsuit that's affordable and has a good style, do you know of any brands that make them?model
+Output:
+Limerence makes a great affordable jumpsuit that's stylish and comes in a variety of colors.
+ Product Name: Limerence Solid Women's Jumpsuit
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Ideal For: Women's
+- Color: Beige
+- Sleeve: Sleeveless
+- Fabric: Rayon
+- Neck: Round Neck
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a comfortable white shirt for casual wear, any suggestions?model
+Output:
+You might like the Marc N' Park Men's Solid Casual Shirt, it's made of 100% cotton and has a slim fit.
+ Product Name: Marc N' Park Men's Solid Casual Shirt
+Product Category: Men's Clothing
+Product Details:
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Men's
+- Closure: Buttons
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Body Fit
+- Collar: Spread Collar
+- Fabric: 100% Cotton
+- Pockets: 1 Pocket On Chest
+- Placket: Button Down Placket
+- Fit: Slim
+- Hem: Curved Hem
+- Other Details: Please Refer Our Size Chart
+- Style Code: 15052A
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Is the Awesome Silk Embroidered Blouse Material already stitched?model
+Output:
+No, it's unstitched and comes in the form of a blouse material.
+ Product Name: Awesome Silk Embroidered Blouse Material
+Product Category: Women's Clothing
+Product Details:
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Silk
+- Type: Blouse Material
+- Pattern: Embroidered
+- Ideal For: Women's
+- Color: Pink
+- Style Code: AWB1-B08PINK
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Is there a comfortable, red bra that's full coverage and wire-free?model
+Output:
+The S4S Stylish Women's Full Coverage Bra is a wire-free, full coverage bra that comes in red and is made of cotton.
+ Product Name: S4S Stylish Women's Full Coverage Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Red
+- color: Red
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+- Wire Support: Wirefree
+- Detachable Straps: No
+- Straps: Regular
+- Number of Contents in Sales Package: Pack of 1
+- Cup Type: Non Padded
+- Fabric: Cotton
+- Type: Full Coverage Bra
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a comfortable, casual t-shirt for the weekend, something with a bit of style. Any suggestions?model
+Output:
+The Numero Uno Solid Men's Round Neck T-Shirt might be a good option. It's made of cotton, has a slim fit, and comes in a solid color. It's perfect for casual wear and is available online at Flipkart.com.
+ Product Name: Numero Uno Solid Men's Round Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Slim Fit
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Men's
+- Style Code: NMFNFZ129_CORN YELLOW..F
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a comfortable and stylish shirt for everyday wear, what's a good option with a printed design?model
+Output:
+The Tokyo Talkies Women's Printed Casual Shirt is a great choice, it's made of 100% polyester and has a mandarin collar and full sleeves.
+ Product Name: Tokyo Talkies Women's Printed Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Printed
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Collar: Mandarin Collar
+- Fabric: 100% Polyester
+- Fit: Regular
+- Style Code: Tttp000623 White
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: What's a comfortable, knee-length dress made of rayon that I can wear for different occasions?model
+Output:
+The Lady Stark Women's Shift Dress is a great option! It's made of rayon, has a round neck, and is knee-length. It's perfect for casual, festive, and party occasions.
+ Product Name: Lady Stark Women's Shift Dress
+Product Category: Women's Clothing
+Product Details:
+- Length: Midi/Knee Length
+- Pattern: Solid
+- Occasion: Casual, Festive, Party
+- Ideal For: Women's
+- Sleeve: Full Sleeve
+- Fabric: Rayon
+- Type: Shift
+- Neck: Round Neck
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a red bra for a special occasion, but I want something that's comfortable and provides good support. Do you have any recommendations?model
+Output:
+The Lovinoform Latest Comfortable Stylist Non Wired Bra in red might be a good option. It's a non-padded, wire-free bra with regular straps and a hook closure, perfect for special occasions.
+ Product Name: Lovinoform Latest Comfortable Stylist Non Wired Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Red
+- color: Red
+- Ideal For: Women's
+- Occasion: Casual, Festive, Formal, Party, Wedding
+- Wire Support: Wirefree
+- Inner Lining: Cotton Lining
+- Straps: Regular
+- Detachable Straps: Yes
+- Number of Contents in Sales Package: Pack of 1
+- Cup Type: Molded Cups
+- Fabric: Cotton Lycra
+- Type: Non Padded
+- Seam Type: Seamed
+- Style Code: 10003115
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm a fan of the Teenage Mutant Ninja Turtles and need a new t-shirt. What's a good option that's comfortable and has a cool design?model
+Output:
+The Ninja Turtles Printed Men's Round Neck T-Shirt is a great choice, it's made of cotton, has a slim fit, and features a cool print of the turtles on the front.
+ Product Name: Ninja Turtles Printed Men's Round Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Slim Fit
+- Pattern: Printed
+- Ideal For: Men's
+- Occasion: Casual
+- Style Code: FST-1019_Orange..F
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a comfortable, casual t-shirt with a unique design. What are some options with a round neck and half sleeves?model
+Output:
+The Ocean Race Graphic Print Men's Round Neck T-Shirt is a good option. It's made of cotton, has a graphic print, and is perfect for casual wear.
+ Product Name: Ocean Race Graphic Print Men's Round Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Regular
+- Pattern: Graphic Print
+- Occasion: Casual
+- Ideal For: Men's
+- Style Code: OCR-1053
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: What's a good option for a half-sleeve, round-neck sheath dress that's not too expensive?model
+Output:
+The Lady Stark Women's Shift Dress is a great choice! It's a half-sleeve, round-neck sheath dress made from georgette fabric and is available for Rs. 999.
+ Product Name: Lady Stark Women's Shift Dress
+Product Category: Women's Clothing
+Product Details:
+- Length: Midi/Knee Length
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Half Sleeve
+- Fabric: Georgette
+- Type: Shift
+- Neck: Round Neck
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a stylish, casual t-shirt for men. What's a good option with a printed design and half sleeves?model
+Output:
+The Oviyon Printed Men's V-neck T-Shirt is a great choice, it's made of cotton, has a slim fit, and comes in various printed designs.
+ Product Name: Oviyon Printed Men's V-neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: V-neck
+- Fit: Slim
+- Pattern: Printed
+- Ideal For: Men's
+- Occasion: Casual
+- Style Code: OVNGYVNHS04
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Do you have any women's shirts made of chiffon with a solid pattern?model
+Output:
+The Tokyo Talkies Women's Solid Casual Shirt is made of 100% polyester, so it's not chiffon.
+ Product Name: Tokyo Talkies Women's Solid Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Fabric: 100% Polyester
+- Collar: Mandarin Collar
+- Fit: Regular
+- Style Code: TTSH000057BAKED CLAY
+---
+Prompt:
+userContext: Online shopping for Kids' ClothingQuestion: What are some cute baby girl boxer shorts with a floral print?model
+Output:
+BelleGirl Floral Print Baby Girl's Boxer Drawers are a great option, they come in a pack of 3 and are made of 100% cotton.
+ Product Name: BelleGirl Floral Print Baby Girl's Boxer
+Product Category: Kids' Clothing
+Product Details:
+- Pattern: Floral Print
+- Occasion: Casual
+- Ideal For: Baby Girl's
+- Style Code: BLKW_DRAW_LIGH
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need some new casual pants for everyday wear. Do the G Fashion striped cotton harem pants come in a pack or just one pair?model
+Output:
+The G Fashion Striped Cotton Women's Harem Pants are a pack of 1.
+ Product Name: G Fashion Striped Cotton Women's Harem Pants
+Product Category: Women's Clothing
+Product Details:
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Pattern: Striped
+- Ideal For: Women's
+- Occasion: Casual
+---
+Prompt:
+userContext: Online shopping for Kids' ClothingQuestion: What makes the NOQNOQ Girl's Fit and Flare Dress special?model
+Output:
+It's a sleeveless, solid pink dress with a fit and flare style, perfect for casual wear.
+ Product Name: NOQNOQ Girl's Fit and Flare Dress
+Product Category: Kids' Clothing
+Product Details:
+- Length: Midi/Knee Length
+- Pattern: Solid
+- Ideal For: Girl's
+- Occasion: Casual
+- Sleeve: Sleeveless
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Fit and Flare
+- Neck: Round Neck
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Is the Monmione Women's Push-up Bra adjustable?model
+Output:
+The Monmione Women's Push-up Bra has regular straps, but they are not detachable.
+ Product Name: Monmione Women's Push-up Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Red and Black
+- Age Group: NA - NA month
+- color: Red, Black
+- Pattern: Solid
+- Ideal For: Women's
+- Wire Support: Underwire
+- Detachable Straps: No
+- Straps: Regular
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Push-up Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for some leggings that are comfortable enough to wear all day long. Do you have any recommendations?model
+Output:
+The Jazzup Women's Leggings are made of silk and are designed for comfort and a flattering fit. They're perfect for casual wear and can be worn for long periods of time.
+ Product Name: Jazzup Women's Leggings
+Product Category: Women's Clothing
+Product Details:
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Silk
+- Type: Leggings
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Women's
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a stylish green party shirt for women, any recommendations?model
+Output:
+You might like the I Am For You Women's Solid Party Shirt, it's a slim fit, sleeveless top with a V-neck and made of polyester.
+ Product Name: I Am For You Women's Solid Party Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Party
+- Sleeve: Sleeveless
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Slim
+- Fabric: Polyester
+- Collar: V-Neck
+- Fit: Slim
+- Style Code: IMFU-FK-113
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a women's top with a floral print, what are some options that are comfortable and casual?model
+Output:
+The Peppermint Blues Casual Short Sleeve Floral Print Women's Top is a great choice, it's made of rayon and has a round neck.
+ Product Name: Peppermint Blues Casual Short Sleeve Floral Print Women's Top
+Product Category: Women's Clothing
+Product Details:
+- Sleeve: Short Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Rayon
+- Type: Top
+- Neck: Round Neck
+- Pattern: Floral Print
+- Occasion: Casual
+- Ideal For: Women's
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Is the Pratami Cotton Silk Blend Solid Blouse Material available in a pack of 1?model
+Output:
+Yes, the Pratami Cotton Silk Blend Solid Blouse Material is available in a pack of 1.
+ Product Name: Pratami Cotton Silk Blend Solid Blouse Material
+Product Category: Women's Clothing
+Product Details:
+- Dupatta Fabric: NA
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton Silk Blend
+- Top Fabric: Cotton Silk Blend
+- Type: Blouse Material
+- Series: Fashion
+- Bottom Fabric: Cotton Silk Blend
+- Design: None
+- Top Length: 39.37 inch
+- Top Width: 36 inch
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Formal
+- Color: Brown
+- Style Code: PSB0008A
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a comfortable, everyday bra that provides good support, any recommendations?model
+Output:
+The S4S Stylish Women's Push-up Bra is a good choice, it's wire-free and comes in pink.
+ Product Name: S4S Stylish Women's Push-up Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Pink
+- color: Pink
+- Pattern: Solid
+- Ideal For: Women's
+- Occasion: Casual
+- Wire Support: Wirefree
+- Inner Lining: Cotton Lining
+- Straps: Regular
+- Detachable Straps: No
+- Number of Contents in Sales Package: Pack of 1
+- Cup Type: Non Padded
+- Fabric: Cotton
+- Type: Push-up Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a printed dress that's a bit longer in the back. Is the Meiro Women's High Low Dress a good option?model
+Output:
+Yes, the Meiro Women's High Low Dress has a printed pattern and is designed to be longer in the back, providing a unique and stylish touch.
+ Product Name: Meiro Women's High Low Dress
+Product Category: Women's Clothing
+Product Details:
+- Length: Midi/Knee Length
+- Pattern: Printed
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Sleeveless
+- Belt Included: No
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Shell - Polyester Crepe, Slip - Polyester Spandex
+- Type: High Low
+- Neck: Round Neck
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: What's a good quality salwar suit dupatta material that's not too expensive?model
+Output:
+The Sinina Cotton Polyester Blend Solid Salwar Suit Dupatta Material is a great option, it's made of a cotton polyester blend and is priced at Rs. 699.
+ Product Name: Sinina Cotton Polyester Blend Solid Salwar Suit Dupatta Material
+Product Category: Women's Clothing
+Product Details:
+- Fabric: Cotton Polyester Blend
+- Type: Salwar Suit Dupatta Material
+- Pattern: Solid
+- Ideal For: Women's
+- Color: Green
+- Style Code: 112Tangy5003
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a pack of 3 skinny jeans for women, what are some good options?model
+Output:
+Vero Moda makes a solid skinny fit women's jeans in a pack of 3, they're made of cotton and have a mid rise.
+ Product Name: Vero Moda Solid Women's Jeans
+Product Category: Women's Clothing
+Product Details:
+- Number of Contents in Sales Package: Pack of 3
+- Brand Fit: skinny
+- Fabric: Cotton
+- Rise: Mid Rise
+- Wash: Regular
+- Pattern: Solid
+- Ideal For: Women's
+- Style Code: 101468211115-BL-024
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a maxi dress with a high slit, what are some options?model
+Output:
+The Vero Moda Women's Maxi Dress has a high slit and is available in blue.
+ Product Name: Vero Moda Women's Maxi Dress
+Product Category: Women's Clothing
+Product Details:
+- Length: Maxi/Full Length
+- Pattern: Printed
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Sleeveless
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Polyester
+- Type: Maxi
+- Neck: Round Neck
+- Other Details: Button Closure at Front, 2 Slant Pockets
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I want to buy a shirt that I can pair with a skirt and heels for a work event. What's a good option?model
+Output:
+The Aussehen Women's Striped Formal Shirt is a great choice, it's a classic style with a point collar and button closure.
+ Product Name: Aussehen Women's Striped Formal Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Striped
+- Ideal For: Women's
+- Occasion: Formal
+- Sleeve: Sleeveless
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Georgette
+- Collar: Point Collar
+- Fit: Regular
+- Style Code: AUC 5138
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a comfortable, stylish polo shirt for casual wear. What's a good option that comes in a pack of 3?model
+Output:
+The Nimya Solid Men's Polo Neck T-Shirt is a great choice, it's made of cotton, has a ribbed collar, and comes in a pack of 3 with different colors.
+ Product Name: Nimya Solid Men's Polo Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 3
+- Collar: Ribbed Collar
+- Fabric: Cotton
+- Type: Polo Neck
+- Fit: Regular
+- Pattern: Solid
+- Ideal For: Men's
+- Occasion: Casual
+- Style Code: NMP241MBG
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a comfortable bra that's not too fancy for everyday wear. Can you recommend something?model
+Output:
+The S4S Printed Women's Full Coverage Bra is a good choice. It's made of cotton and has a simple, printed design.
+ Product Name: S4S Printed Women's Full Coverage Bra
+Product Category: Women's Clothing
+Product Details:
+- Brand Color: Blue
+- color: Blue
+- Pattern: Printed
+- Ideal For: Women's
+- Occasion: Casual
+- Wire Support: Wirefree
+- Inner Lining: Cotton Lining
+- Straps: Regular
+- Detachable Straps: No
+- Number of Contents in Sales Package: Pack of 1
+- Cup Type: Non Padded
+- Fabric: Cotton
+- Type: Full Coverage Bra
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I need a dress for a casual outing, something with an animal print and a short length.model
+Output:
+The Miss Chase Women's Shift Dress is a good option, it's a mini/short, animal print dress with a V-neck and sleeveless design.
+ Product Name: Miss Chase Women's Shift Dress
+Product Category: Women's Clothing
+Product Details:
+- Length: Mini/Short
+- Pattern: Animal Print
+- Ideal For: Women's
+- Occasion: Casual
+- Sleeve: Sleeveless
+- Fabric: Crepe
+- Neck: V-Neck
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Hey Google, I'm looking for a cute striped shirt for casual wear, any recommendations?model
+Output:
+You might like the Tokyo Talkies Women's Striped Casual Shirt, it's 100% cotton and has a mandarin collar.
+ Product Name: Tokyo Talkies Women's Striped Casual Shirt
+Product Category: Women's Clothing
+Product Details:
+- Pattern: Striped
+- Occasion: Casual
+- Ideal For: Women's
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Brand Fit: Regular
+- Collar: Mandarin Collar
+- Fabric: 100% Cotton
+- Type: Shirt
+- Fit: Regular
+- Style Code: TTSH000025DARK GREY
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for a comfortable, casual t-shirt for men. Any recommendations?model
+Output:
+The Numero Uno Solid Men's Round Neck T-Shirt is a good option, it's made of cotton, has a slim fit, and is perfect for casual occasions.
+ Product Name: Numero Uno Solid Men's Round Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Full Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Slim
+- Pattern: Solid
+- Occasion: Casual
+- Ideal For: Men's
+- Style Code: NMFNFZ129_Grey..F
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I want a comfortable, casual t-shirt for men that's also stylish, any suggestions?model
+Output:
+The Numero Uno Printed Men's Round Neck T-Shirt is a good choice, it's made of cotton, has a slim fit, and comes in a variety of prints.
+ Product Name: Numero Uno Printed Men's Round Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Cotton
+- Type: Round Neck
+- Fit: Slim
+- Pattern: Printed
+- Occasion: Casual
+- Ideal For: Men's
+- Style Code: NMFNHZ474_CYAN BLUE..F
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: I'm looking for a pair of skinny jeans for women, are there any that come in a pack of two?model
+Output:
+The GALLOWAY skinny Fit Women's Jeans come in a pack of two and are a good option for a skinny fit.
+ Product Name: GALLOWAY skinny Fit Women's Jeans
+Product Category: Women's Clothing
+Product Details:
+- Ideal For: Women's
+- Pattern: Solid
+- Fabric: satin , denim
+- Brand Fit: skinny
+- Rise: Mid Rise
+- Number of Contents in Sales Package: Pack of 2
+- Style Code: GALL_1016+17_30
+---
+Prompt:
+userContext: Online shopping for Kids' ClothingQuestion: Is this kea Girl's Vest sleeveless?model
+Output:
+Yes, it's sleeveless.
+ Product Name: kea Girl's Vest
+Product Category: Kids' Clothing
+Product Details:
+- Sleeve: Sleeveless
+- Number of Contents in Sales Package: Pack of 1
+- Fabric: Imported Viscose Jersey
+- Neck: Round Neck
+- Pattern: Solid
+- Ideal For: Girl's
+- Other Details: Grip Fit Style
+- Style Code: kea white vest
+---
+Prompt:
+userContext: Online shopping for Men's ClothingQuestion: I'm looking for comfortable polo shirts for casual wear, any recommendations for a multi-pack?model
+Output:
+Candy House Solid Men's Polo Neck T-Shirt is a good option, it comes in a pack of 4 and is made of a comfortable polycotton blend.
+ Product Name: Candy House Solid Men's Polo Neck T-Shirt
+Product Category: Men's Clothing
+Product Details:
+- Sleeve: Half Sleeve
+- Number of Contents in Sales Package: Pack of 4
+- Fabric: POLYCOTTON
+- Type: Polo Neck
+- Pockets: NO
+- Fit: Regular
+- Pattern: Solid
+- Ideal For: Men's
+- Occasion: Casual
+- Style Code: POLO4_RBLURN_YLO_GRY
+---
+Prompt:
+userContext: Online shopping for Women's ClothingQuestion: Is there a multi-purpose fabric that I can use for both a saree and a blouse?model
+Output:
+Yes, the Neon Cotton Embroidered Multi-purpose Fabric is perfect for both sarees and blouses. It's a cotton fabric with an embroidered pattern, and it comes with a blouse piece.
+ Product Name: Neon Cotton Embroidered Multi-purpose Fabric
+Product Category: Women's Clothing
+Product Details:
+- Fabric: Cotton
+- Type: Multi-purpose Fabric
+- Pattern: Embroidered
+- Ideal For: Women's
+- Color: Pink
+- Style Code: NAD41
+---
+Prompt:
+