-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] Broader SageMaker Training API support in Step Operator #1398
Comments
@christianversloot Very interesting. Let me get the team together to discuss this and we'll write a more thoughtful response. In general, would love to receive a PR to enable more support for Sagemaker. Btw, have you taken a look at the Sagemaker Orchestrator as well, or would you want to stick to Sagemaker step operators? In the latter case, what orchestrator are you using? |
Thanks, looking forward to reading your thoughts! We are in the very early stages of ZenML adoption - installed ZenML and created a local orchestrator and S3 based artifact store for now. We follow the 'local if possible, cloud if necessary' paradigm since we're also working with smaller-data traditional ML approaches which often don't require cloud-based instances and can run on a local orchestrator. We are thinking about Sagemaker orchestration for our pipelines, but may also choose for (self-hosted) Airflow to let our core remain independent from one of the major cloud vendors. Then, the step operators are very useful should we want to run specific steps with a specific cloud vendor (although we're primarily focused on AWS for ML for now). |
Makes sense! Let me now answer some of your questions in-line. Still brainstorming internally but would love to get you looped in:
Gut feeling is preferably it would be one step operator that takes different configurations that then define the internals. For example, I can pass However, then the question is: Is there a common configuration set that spans across all these different types of operators, and how can one configure the individual operators that have specific configuration (I imagine AutoMLStepOperator has different config than TensorflowStepOperator. We had a similar problem with the AirflowOrchestrator where airflow can also run on different 'operators'. The way we solved it is to have the
If we go with one flavor, then I think we can get away without this complexity hidden away.
I feel like this is a bit question that we need to addess with another abstraction in ZenML. Currently we have the SparkStepOperator that allows for massive data processing jobs but tbh we need to flesh this out with other ideas. Ideally, we'd have a |
@christianversloot Eager to hear thoughts! |
I think it would be a good idea to apply a similar solution as you've used for the The solution that we are implementing for expanding the current step operator is already moving into that direction by allowing users to specify everything in Unfortunately, in SageMaker, not all estimators, tuners, etc, share the same way of working. Some rely more on keywords arguments; others more on regular arguments. For example:
Maybe, we can solve this in a generic way by allowing users to pass instances of
That way, there is a generic class structure for specifying settings. Then, depending on the settings object's instance type, a specific implementation of the In the various settings sub classes, we can specify the required arguments for each SageMaker functionality being requested (i.e. the varying amounts of regular arguments), and allow all keyword arguments to be passed as a dict: In other words, configuration is then done in the pipeline (at step level), as that's where I'd prefer it to be:
That way, maybe it is no longer necessary to specify the specific type of operator during registration ( Hope this makes sense? |
@christianversloot Now had the time to digest and debrief also with @schustmi and @strickvl internally. I think we agree on the overall solution of implementing this all inside a single step flavor for sure. However, the proposal of passing different settings classes doesn't work with pydantic we think. Instead, a solution would be to have those "operator-specific" configurations as an attribute on the SagemakerStepOperatorSettings:
That might make the most sense but maybe it would make it hard to make the base class extensible to future implementation. However, we can always figure this out when we get there. If you guys are up for it, we're happy to help with this extension to ZenML. Would you have time/capacity to create a PR for it? We can jump on it asap to get it into the latest possible release! I think it would be a very impactful community driven feature |
I would agree with the Design-wise, do I understand correctly that below is an approach that we all agree on (except some details, maybe, but primarily the generics)?
Time-wise, I do have approximately 1 day per week available for setting up the ZenML stack and its components. A lot is already running and we are now in the process of adding the first components and creating the first stacks. This means that there will be some time for me to start working on broader API support, especially because it's very likely that we're going to need it. Also, given the design (when we all consider it to be final), I don't think the changes will be very complex, so we should be able to make some nice progress. Perhaps, release wise, break it down into smaller steps (e.g. step A being the regular Estimator following the new setup, step B the TensorFlow estimator, etc.) so that if needed they can be released individually and some steps can be released during the next release? |
@schustmi Do you agree with that summary? If yes, @christianversloot , lets have a 30 minute discussion about how to break this down in a public hellonext ticket and track it out? |
Sure! Perhaps we can schedule via Slack. I'll take a look on Monday :) |
Awesome! |
Looking forward to hearing what the next steps are 😊 |
@christianversloot Reached out via Slack. if somebody else is on this thread in the future, please let us know here and we'll loop you in! |
For those who read this at a later date, the design doc for this extension is found here: https://zenml.notion.site/Design-Document-Template-for-SageMakerStepOperator-Extension-in-ZenML-56a4ecda560c413eb9f70dd2d4c3c41a |
In #1381, we are expanding the SageMaker Step Operator with additional estimator arguments and training data inputs.
Currently, the SageMaker Step Operator in ZenML uses a default Estimator for starting a Training Job. However, there is a broader range of Training APIs available in SageMaker: https://sagemaker.readthedocs.io/en/stable/api/training/index.html
In my view, and at least on our side I can foresee some use cases (especially for data processing steps in ZenML) where some of these Training APIs are needed:
This would be quite a big feature. Before just shooting in a PR, I have a few questions that we should perhaps discuss first. I'm not sure whether an issue is the correct place for that, but I read somewhere that "before making a massive PR make an issue for discussion first", so here it is :)
import
structure insagemaker_step_operator.py
andsagemaker_step_operator_flavor.py
. The first imports the flavor classes from the second; the second imports the implementation class from the first. If we'd have multiple implementation classes (one for each Training API I'd say)? A possible structure could be to have theSagemakerStepOperator
import the flavors and based on the config pick the right one, where the flavors implement the Training API implementations which also inherit fromBaseStepOperator
. But that feels off; perhaps splitting the Training APIs over multiple SageMaker step operators would be simpler and less complex for the user ('just pick the right step operator') instead of ('pick the step operator, find how I need to configure it to use insert Training API, then find out how I need to add specific configuration for that training API').I'd love to hear your thoughts :)
Checks
The text was updated successfully, but these errors were encountered: