Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: gen ai tuning and eval sample #1628

Merged
merged 30 commits into from
Jan 27, 2025
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
8bd76b5
add tuning and eval samples
willisc7 Jan 16, 2025
0361647
update author and codeowners
willisc7 Jan 16, 2025
5f50a4a
Update gemini/sample-apps/genai-mlops-tune-and-eval/README.md
willisc7 Jan 17, 2025
9d16dfd
Update gemini/sample-apps/genai-mlops-tune-and-eval/README.md
willisc7 Jan 17, 2025
930466f
Update gemini/sample-apps/genai-mlops-tune-and-eval/README.md
willisc7 Jan 17, 2025
a75f685
Update gemini/sample-apps/genai-mlops-tune-and-eval/pipeline.py
willisc7 Jan 17, 2025
34c17ef
Formatting/Spelling
holtskinner Jan 17, 2025
09ffdbc
Sort allowlist
holtskinner Jan 17, 2025
581c124
Update gemini/sample-apps/genai-mlops-tune-and-eval/local/pipeline.py
willisc7 Jan 17, 2025
86c4b66
Update gemini/sample-apps/genai-mlops-tune-and-eval/README.md
willisc7 Jan 17, 2025
10225d9
check-spelling updates to README
willisc7 Jan 17, 2025
e90d769
change glucose example bucket
willisc7 Jan 17, 2025
f877807
moved to tuning directory
willisc7 Jan 17, 2025
858db02
use GenerationConfig
willisc7 Jan 17, 2025
24907f8
insert placeholders
willisc7 Jan 17, 2025
4e76e71
Merge branch 'main' into main
holtskinner Jan 21, 2025
0de450a
remove local examples
willisc7 Jan 21, 2025
c6603b5
fixing lint errors
willisc7 Jan 22, 2025
157eab1
fixing linting errors
willisc7 Jan 22, 2025
b8a8bfe
Formatting/lint errors
holtskinner Jan 23, 2025
f927a85
Merge branch 'main' into main
holtskinner Jan 23, 2025
6e5a4fc
add requirements.txt
willisc7 Jan 24, 2025
96d7d78
return pipeline to working after linting changes
willisc7 Jan 24, 2025
a75ecd1
re-add linting changes. works.
willisc7 Jan 24, 2025
b2e0e9a
avoid returning tuple by printing best response and metrics within mo…
willisc7 Jan 24, 2025
c5e1e04
return namedtuple
willisc7 Jan 24, 2025
d46fccb
added logging of returned component values per Gemini's suggestion
willisc7 Jan 24, 2025
8eb202b
Merge branch 'main' into main
holtskinner Jan 27, 2025
e44bf28
Fix lint errors
holtskinner Jan 27, 2025
e804e20
Fix Markdown lint error
holtskinner Jan 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,4 @@
/generative-ai/vision/use-cases/hey_llm @tushuhei @GoogleCloudPlatform/generative-ai-devrel
/generative-ai/gemini/sample-apps/llamaindex-rag/backend/indexing/ @Lionel-Lim @GoogleCloudPlatform/generative-ai-devrel
/generative-ai/gemini/multimodal-live-api/websocket-demo-app/ @ZackAkil @GoogleCloudPlatform/generative-ai-devrel
/generative-ai/gemini/sample-apps/genai-mlops-tune-and-eval @willisc7 @GoogleCloudPlatform/generative-ai-devrel
5 changes: 5 additions & 0 deletions gemini/sample-apps/genai-mlops-tune-and-eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
__pycache__
pipeline.json
venv
local/local_outputs
local/venv
91 changes: 91 additions & 0 deletions gemini/sample-apps/genai-mlops-tune-and-eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# GenAI MLOps Tune and Evaluation

Check warning on line 1 in gemini/sample-apps/genai-mlops-tune-and-eval/README.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`GenAI` matches a line_forbidden.patterns entry: `\b[gG]enAI\b`. (forbidden-pattern)
Author: [Chris Willis](https://github.com/willisc7)

This tutorial will take you through using Vertex AI Pipelines to automate tuning an LLM and evaluating it against a previously tuned LLM. The example used is an LLM that summarizes a week of glucose values for a diabetes patient.

![Diagram](./diagram.png)

## Optional: Prepare the data
This step is optional because I've already prepared the data in `patient_1_glucose_examples.jsonl`.
* Create a week of glucose sample data for one patient using the following prompt with Gemini:
```
Create a CSV with a week's worth of example glucose values for a diabetic patient. The columns should be date, time, patient ID, and glucose value. Each day there should be timestamps for 7am, 8am, 11am, 12pm, 5pm, and 6pm. Most of the glucose values should be between 70 and 100. Some of the glucose values should be 100-150.
```
* Flatten the CSV by doing the following:
1. Open the CSV
2. Press Ctrl + a to select all text
3. Press Alt + Shift + i to go to the end of each line
4. Add a newline character (i.e. \n)
5. Press Delete to squash it all to a single line
* Copy glucose_examples_template.jsonl to patient_X_glucose_examples.jsonl
* Copy the flattened CSV and paste it into the patient_X_glucose_examples.jsonl
* Flatten the contents of the patient_X_glucose_examples.jsonl file using a JSON to JSONL converter online

## Setup IAM, Tuning Examples, and Vertex AI Pipelines
* Grant default compute svc acct IAM permissions
```
PROJECT_NUMBER=$(gcloud projects describe $(gcloud config get-value project) --format="value(projectNumber)")
SERVICE_ACCOUNT="${PROJECT_NUMBER}[email protected]"

gcloud projects add-iam-policy-binding $PROJECT_NUMBER \
--member="serviceAccount:${SERVICE_ACCOUNT}" \
--role="roles/aiplatform.user"
gcloud projects add-iam-policy-binding $PROJECT_NUMBER \
--member="serviceAccount:${SERVICE_ACCOUNT}" \
--role="roles/storage.objectUser"
```
* Create a GCS bucket and upload the JSONL with the glucose and analysis examples to tune the model:
```
gsutil mb gs://glucose-test-bucket-$(date +%Y%m%d)
gsutil cp patient_1_glucose_examples.jsonl gs://glucose-test-bucket-<DATETIME>
```
* Create the pipeline root bucket
```
gsutil mb gs://vertex-ai-pipeline-root-$(date +%Y%m%d)
```

## Run Vertex AI Pipelines
* Install required packages and compile the pipeline
```
python3 -m venv venv
source venv/bin/activate
pip install kfp google-cloud-aiplatform
kfp dsl compile --py pipeline.py --output pipeline.json
```
* Edit `pipeline.py` and change the following:
* `project` - change to your project ID
* `train_data_uri` - change to `gs://glucose-test-bucket-<DATETIME>/patient_1_glucose_examples.jsonl`
* Edit `submit_pipeline_job.py` and change the following:
* `pipeline_root` - change to the `gs://vertex-ai-pipeline-root-<DATETIME>` bucket you created earlier
* `project` - change to your project ID
* `train_data_uri` - change to `gs://glucose-test-bucket-<DATETIME>/patient_1_glucose_examples.jsonl`
* Create the pipeline run
```
python submit_pipeline_job.py
```
* For subsequent runs, change `baseline_model_endpoint` in pipeline.py to a tuned model endpoint you want to compare against (typically the previously trained endpoint)

## Optional: Run Locally Using Kubeflow Pipelines
This step is optional because you can run the pipeline in Vertex AI Pipelines. However, if you're going to take this pipeline and develop on top of it, it's easier and faster to run the pipeline locally using Kubeflow.
* Install required packages
```
cd ./local
python3 -m venv venv
source venv/bin/activate
pip install kfp google-cloud-aiplatform vertexai plotly google-cloud-aiplatform[evaluation]
```
* Create local docker image using podman or docker-cli that has gcloud ADC copied over (**IMPORTANT:** never push this container to a public repo)

Check warning on line 77 in gemini/sample-apps/genai-mlops-tune-and-eval/README.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`podman` is not a recognized word. (unrecognized-spelling)
```
gcloud auth application-default login
cp $HOME/.config/gcloud/application_default_credentials.json .
podman build -t python-3.9-gcloud .

Check warning on line 81 in gemini/sample-apps/genai-mlops-tune-and-eval/README.md

View workflow job for this annotation

GitHub Actions / Check Spelling

`podman` is not a recognized word. (unrecognized-spelling)
rm application_default_credentials.json
```
* Edit `pipeline.py` and change the following:
* `project` - change to your project ID
* `train_data_uri` - change to `gs://glucose-test-bucket-<DATETIME>/patient_1_glucose_examples.jsonl`
* Create the pipeline run
```
python pipeline.py
```
* For subsequent runs, change `baseline_model_endpoint` in pipeline.py to a tuned model endpoint you want to compare against (typically the previously trained endpoint)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions gemini/sample-apps/genai-mlops-tune-and-eval/local/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
FROM python:3.9

WORKDIR /app
COPY ./application_default_credentials.json /app/credentials.json
ENV GOOGLE_APPLICATION_CREDENTIALS=/app/credentials.json
ENV GOOGLE_CLOUD_PROJECT=genai-mlops-tune-and-eval
242 changes: 242 additions & 0 deletions gemini/sample-apps/genai-mlops-tune-and-eval/local/pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
from kfp import local
from kfp import dsl

local.init(runner=local.DockerRunner())

@dsl.component(
packages_to_install=["google-cloud-aiplatform", "vertexai"],
base_image="localhost/python-3.9-gcloud:latest",
)

def gemini_tuning_component(
project: str,
location: str,
source_model: str,
train_dataset_uri: str,
) -> str: # Output the tuned model name as a string

import time
import vertexai
from vertexai.tuning import sft

vertexai.init(project=project, location=location)

tuned_model_display_name=f"tuned-{source_model}-{int(time.time())}"

sft_tuning_job = sft.train(
source_model=source_model,
train_dataset=train_dataset_uri,
tuned_model_display_name=tuned_model_display_name,
)

while not sft_tuning_job.has_ended:
time.sleep(60)
sft_tuning_job.refresh()

return sft_tuning_job.tuned_model_endpoint_name

@dsl.component(
packages_to_install=["google-cloud-aiplatform[evaluation]","bigframes"],
base_image="localhost/python-3.9-gcloud:latest",
)
def model_comparison_component(
project: str,
location: str,
baseline_model_endpoint: str, # Baseline model name
candidate_model_endpoint: str, # Candidate model name
):
import functools
from functools import partial
import uuid

from google.cloud import aiplatform
import pandas as pd
from vertexai.evaluation import EvalTask, MetricPromptTemplateExamples
from vertexai.generative_models import GenerationConfig, GenerativeModel

experiment_name = "qa-quality"

def pairwise_greater(
instructions: list,
context: str,
project: str,
location: str,
experiment_name: str,
baseline: str,
candidate: str,
) -> tuple:
"""
Takes Instructions, Context and two different responses.
Returns the response which best matches the instructions/Context for the given
quality metric ( in this case question answering).
More details on the web API and different quality metrics which this function
can be extended to can be found on
https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/evaluation
"""
eval_dataset = pd.DataFrame(
{
"instruction": [instructions],
"context": [context],
"response": [candidate],
"baseline_model_response": [baseline],
}
)

eval_task = EvalTask(
dataset=eval_dataset,
metrics=[
MetricPromptTemplateExamples.Pairwise.QUESTION_ANSWERING_QUALITY,
],
experiment=experiment_name,
)
results = eval_task.evaluate(
prompt_template="{instruction} \n {context}",
experiment_run_name="gemini-qa-pairwise-" + str(uuid.uuid4()),
)
result = results.metrics_table[
[
"pairwise_question_answering_quality/pairwise_choice",
"pairwise_question_answering_quality/explanation",
]
].to_dict("records")[0]
choice = (
baseline
if result["pairwise_question_answering_quality/pairwise_choice"] == "BASELINE"
else candidate
)
return (choice, result["pairwise_question_answering_quality/explanation"])

def greater(cmp: callable, a: str, b: str) -> int:
"""
A comparison function which takes the comparison function, and two variables as input
and returns the one which is greater according to the logic defined inside the cmp function.
"""
choice, explanation = cmp(a, b)

if choice == a:
return 1
return -1

def pointwise_eval(
instruction: str,
context: str,
responses: list[str],
eval_metrics: list[object] = [
MetricPromptTemplateExamples.Pointwise.QUESTION_ANSWERING_QUALITY,
MetricPromptTemplateExamples.Pointwise.GROUNDEDNESS,
],
experiment_name: str = experiment_name,
) -> object:
"""
Takes the instruction, context and a variable number of corresponding generated responses, and returns the pointwise evaluation metrics
for each of the provided metrics. For this example the metrics are Q & A related, however the full list can be found on the website:
https://cloud.google.com/vertex-ai/generative-ai/docs/models/online-pipeline-services
"""

instructions = [instruction] * len(responses)

contexts = [context] * len(responses)

eval_dataset = pd.DataFrame(
{
"instruction": instructions,
"context": contexts,
"response": responses,
}
)

eval_task = EvalTask(
dataset=eval_dataset, metrics=eval_metrics, experiment=experiment_name
)
results = eval_task.evaluate(
prompt_template="{instruction} \n {context}",
experiment_run_name="gemini-qa-pointwise-" + str(uuid.uuid4()),
)
(results.metrics_table.columns)
return results

def rank_responses(instruction: str, context: str, responses: list[str]) -> tuple:
"""
Takes the instruction, context and a variable number of responses as input, and returns the best performing response as well as its associated
human readable pointwise quality metrics for the configured criteria in the above functions.
The process consists of two steps:
1. Selecting the best response by using Pairwise comparisons between the responses for the user specified metric ( e.g. Q & A)
2. Doing pointwise evaluation of the best response and returning human readable quality metrics and explanation along with the best response.
"""
cmp_f = partial(
pairwise_greater, instruction, context, project, location, experiment_name
)
cmp_greater = partial(greater, cmp_f)

pairwise_best_response = max(responses, key=functools.cmp_to_key(cmp_greater))
pointwise_metric = pointwise_eval(instruction, context, [pairwise_best_response])
qa_metrics = pointwise_metric.metrics_table[
[
col
for col in pointwise_metric.metrics_table.columns
if ("question_answering" in col) or ("groundedness" in col)
]
].to_dict("records")[0]

return pairwise_best_response, qa_metrics

# Compare response from baseline model to candidate model to see which is better
baseline_model = GenerativeModel(
baseline_model_endpoint,
generation_config={"temperature": 0.4,"max_output_tokens": 512,}
)
candidate_model = GenerativeModel(
candidate_model_endpoint,
generation_config={"temperature": 0.4,"max_output_tokens": 512,}
)

instruction_qa = "Analyze the glucose trends in the glucose values provided in the CSV contained in the context. Ensure the analysis you provide can easily be understood by a diabetes patient with no medical expertise."
context_qa = (
"Context:\n"
+ "```csv\ndate,time,patient ID,glucose\n2024-11-12,7:00 AM,1,80\n2024-11-12,8:00 AM,1,96\n2024-11-12,11:00 AM,1,90\n2024-11-12,12:00 PM,1,115\n2024-11-12,5:00 PM,1,77\n2024-11-12,6:00 PM,1,80\n2024-11-13,7:00 AM,1,94\n2024-11-13,8:00 AM,1,100\n2024-11-13,11:00 AM,1,87\n2024-11-13,12:00 PM,1,126\n2024-11-13,5:00 PM,1,71\n2024-11-13,6:00 PM,1,82\n2024-11-14,7:00 AM,1,84\n2024-11-14,8:00 AM,1,72\n2024-11-14,11:00 AM,1,96\n2024-11-14,12:00 PM,1,110\n2024-11-14,5:00 PM,1,99\n2024-11-14,6:00 PM,1,74\n2024-11-15,7:00 AM,1,96\n2024-11-15,8:00 AM,1,97\n2024-11-15,11:00 AM,1,99\n2024-11-15,12:00 PM,1,130\n2024-11-15,5:00 PM,1,99\n2024-11-15,6:00 PM,1,87\n2024-11-16,7:00 AM,1,89\n2024-11-16,8:00 AM,1,92\n2024-11-16,11:00 AM,1,77\n2024-11-16,12:00 PM,1,105\n2024-11-16,5:00 PM,1,79\n2024-11-16,6:00 PM,1,90\n2024-11-17,7:00 AM,1,74\n2024-11-17,8:00 AM,1,82\n2024-11-17,11:00 AM,1,74\n2024-11-17,12:00 PM,1,78\n2024-11-17,5:00 PM,1,95\n2024-11-17,6:00 PM,1,74\n2024-11-18,7:00 AM,1,95\n2024-11-18,8:00 AM,1,87\n2024-11-18,11:00 AM,1,79\n2024-11-18,12:00 PM,1,90\n2024-11-18,5:00 PM,1,79\n2024-11-18,6:00 PM,1,77\n"
)
prompt_qa = instruction_qa + "\n" + context_qa + "\n\nAnswer:\n"

baseline_model_response = baseline_model.generate_content(
contents=prompt_qa,
)
candidate_model_response = candidate_model.generate_content(
contents=prompt_qa,
)
responses = [
baseline_model_response.candidates[0].text,
candidate_model_response.candidates[0].text,
]

best_response, metrics = rank_responses(instruction_qa, context_qa, responses)

for ix, response in enumerate(responses, start=1):
print(f"Response no. {ix}: \n {response}")

print(best_response)
# metrics

@dsl.pipeline
def gemini_tuning_pipeline(
project: str = "genai-mlops-tune-and-eval",
location: str = "us-central1",
source_model_name: str = "gemini-1.5-pro-002",
train_data_uri: str = "gs://glucose-test-bucket/glucose_examples.jsonl",
baseline_model_endpoint: str = "projects/824264063118/locations/us-central1/endpoints/797393320253849600",
):
# Comment out to speed up runs and use past tuned model
tuning_task = gemini_tuning_component(
project=project,
location=location,
source_model=source_model_name,
train_dataset_uri=train_data_uri,
)

comparison_task = model_comparison_component(
project=project,
location=location,
baseline_model_endpoint=baseline_model_endpoint,
candidate_model_endpoint=tuning_task.output,
)

pipeline_task = gemini_tuning_pipeline()
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"systemInstruction":{"role":"system","parts":[{"text":"You are an expert in diabetes care that can provide expert summaries of glucose trends to diabetes patients."}]},"contents":[{"role":"user","parts":[{"text":"Create a high-level summary of the glucose values you are given in the following CSV file that can be easily understood by a diabetes patient.\n\n```csv\ndate,time,patient ID,glucose\n2024-11-12,7:00 AM,1,80\n2024-11-12,8:00 AM,1,96\n2024-11-12,11:00 AM,1,90\n2024-11-12,12:00 PM,1,115\n2024-11-12,5:00 PM,1,77\n2024-11-12,6:00 PM,1,80\n2024-11-13,7:00 AM,1,94\n2024-11-13,8:00 AM,1,100\n2024-11-13,11:00 AM,1,87\n2024-11-13,12:00 PM,1,126\n2024-11-13,5:00 PM,1,71\n2024-11-13,6:00 PM,1,82\n2024-11-14,7:00 AM,1,84\n2024-11-14,8:00 AM,1,72\n2024-11-14,11:00 AM,1,96\n2024-11-14,12:00 PM,1,110\n2024-11-14,5:00 PM,1,99\n2024-11-14,6:00 PM,1,74\n2024-11-15,7:00 AM,1,96\n2024-11-15,8:00 AM,1,97\n2024-11-15,11:00 AM,1,99\n2024-11-15,12:00 PM,1,130\n2024-11-15,5:00 PM,1,99\n2024-11-15,6:00 PM,1,87\n2024-11-16,7:00 AM,1,89\n2024-11-16,8:00 AM,1,92\n2024-11-16,11:00 AM,1,77\n2024-11-16,12:00 PM,1,105\n2024-11-16,5:00 PM,1,79\n2024-11-16,6:00 PM,1,90\n2024-11-17,7:00 AM,1,74\n2024-11-17,8:00 AM,1,82\n2024-11-17,11:00 AM,1,74\n2024-11-17,12:00 PM,1,78\n2024-11-17,5:00 PM,1,95\n2024-11-17,6:00 PM,1,74\n2024-11-18,7:00 AM,1,95\n2024-11-18,8:00 AM,1,87\n2024-11-18,11:00 AM,1,79\n2024-11-18,12:00 PM,1,90\n2024-11-18,5:00 PM,1,79\n2024-11-18,6:00 PM,1,77\n```"}]},{"role":"model","parts":[{"text":"Most of the glucose values are in the normal range of 50 to 99 mg/dL. Several days around 12:00 PM glucose levels were in the high range of 100 to 125 mg/dL or greater."}]}]}

Check warning on line 1 in gemini/sample-apps/genai-mlops-tune-and-eval/patient_1_glucose_examples.jsonl

View workflow job for this annotation

GitHub Actions / Check Spelling

Skipping `gemini/sample-apps/genai-mlops-tune-and-eval/patient_1_glucose_examples.jsonl` because average line width (1186) exceeds the threshold (1000). (minified-file)
Loading
Loading