Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vertex persistent resource to settings for step operator #3304

Open
wants to merge 4 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions docs/book/component-guide/step-operators/vertex.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,3 +136,31 @@ For more information and a full list of configurable attributes of the Vertex st
Note that if you wish to use this step operator to run steps on a GPU, you will need to follow [the instructions on this page](../../how-to/pipeline-development/training-with-gpus/README.md) to ensure that it works. It requires adding some extra settings customization and is essential to enable CUDA for the GPU to give its full acceleration.

<figure><img src="https://static.scarf.sh/a.png?x-pxid=f0b4f458-0a54-4fcd-aa95-d5ee424815bc" alt="ZenML Scarf"><figcaption></figcaption></figure>

#### Using Persistent Resources for Faster Development

When developing ML pipelines that use Vertex AI, the startup time for each CustomJob can be significant since Vertex needs to provision new compute resources for each run. To speed up development iterations, you can use Vertex AI's [Persistent Resources](https://cloud.google.com/vertex-ai/docs/training/persistent-resource-overview) feature, which keeps compute resources warm between runs.

To use persistent resources with the Vertex step operator, you can configure it either when registering the step operator or through the step settings:

```python
from zenml.integrations.gcp.flavors.vertex_step_operator_flavor import VertexStepOperatorSettings

@step(step_operator=<STEP_OPERATOR_NAME>, settings={"step_operator": VertexStepOperatorSettings(
persistent_resource_id="my-persistent-resource", # specify your persistent resource ID
machine_type="n1-standard-4",
accelerator_type="NVIDIA_TESLA_T4",
accelerator_count=1,
)})
def trainer(...) -> ...:
"""Train a model."""
# This step will use the persistent resource and start faster
```

This is particularly useful when:
* You're developing locally and want to iterate quickly on steps that need GPU/TPU resources
* You have a local orchestrator but want to leverage Vertex AI for specific compute-intensive steps

{% hint style="warning" %}
Remember that persistent resources continue to incur costs as long as they're running, even when idle. Make sure to monitor your usage and configure appropriate idle timeout periods.
{% endhint %}
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ A code repository in ZenML refers to a remote storage location for your code. So

Code repositories enable ZenML to keep track of the code version that you use for your pipeline runs. Additionally, running a pipeline that is tracked in a registered code repository can [speed up the Docker image building for containerized stack components](../../infrastructure-deployment/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md) by eliminating the need to rebuild Docker images each time you change one of your source code files.

Learn more about how code repositories benefit development [here](../../infrastructure-deployment/customize-docker-builds/use-code-repositories-to-speed-up-docker-build-times.md).
Learn more about how code repositories benefit development [here](../../customize-docker-builds/how-to-reuse-builds.md).

## Registering a code repository

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,16 @@ class VertexStepOperatorSettings(BaseSettings):
https://cloud.google.com/vertex-ai/docs/training/configure-compute#boot_disk_options
boot_disk_type: Type of the boot disk. (Default: pd-ssd)
https://cloud.google.com/vertex-ai/docs/training/configure-compute#boot_disk_options

persistent_resource_id: The ID of the persistent resource to use for the job.
https://cloud.google.com/vertex-ai/docs/training/persistent-resource-overview
"""

accelerator_type: Optional[str] = None
accelerator_count: int = 0
machine_type: str = "n1-standard-4"
boot_disk_size_gb: int = 100
boot_disk_type: str = "pd-ssd"
persistent_resource_id: Optional[str] = None


class VertexStepOperatorConfig(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,7 @@ def launch(
}
if self.config.encryption_spec_key_name
else {},
"persistent_resource_id": settings.persistent_resource_id,
}
logger.debug("Vertex AI Job=%s", custom_job)

Expand Down
Loading