Project import generated by Copybara. (#38)

GitOrigin-RevId: 58e65003b64918af74ece769567892c98a3f9fbd Co-authored-by: Snowflake Authors <[email protected]>
snowflakedb · Aug 31, 2023 · 192f794 · 192f794
1 parent f3a83fb
commit 192f794
Show file tree

Hide file tree

Showing 159 changed files with 12,013 additions and 3,282 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,41 @@
 # Release History
 
-## 1.0.5
+## 1.0.6
+
+### New Features
+- Model Registry: add `create_if_not_exists` parameter in constructor.
+- Model Registry: Added get_or_create_model_registry API.
+- Model Registry: Added support for using GPU inference when deploying XGBoost (`xgboost.XGBModel` and `xgboost.Booster`), PyTorch (`torch.nn.Module` and `torch.jit.ScriptModule`) and TensorFlow (`tensorflow.Module` and `tensorflow.keras.Model`) models to Snowpark Container Services.
+- Model Registry: When inferring model signature, `Sequence` of built-in types, `Sequence` of `numpy.ndarray`, `Sequence` of `torch.Tensor`, `Sequence` of `tensorflow.Tensor` and `Sequence` of `tensorflow.Tensor` can be used instead of only `List` of them.
+- Model Registry: Added `get_training_dataset` API.
+- Model Development: Size of metrics result can exceed previous 8MB limit.
+- Model Registry: Added support save/load/deploy HuggingFace pipeline object (`transformers.Pipeline`) and our wrapper (`snowflake.ml.model.models.huggingface_pipeline.HuggingFacePipelineModel`) to it. Using the wrapper to specify configurations and the model for the pipeline will be loaded dynamically when deploying. Currently, following tasks are supported to log without manually specifying model signatures:
+    - "conversational"
+    - "fill-mask"
+    - "question-answering"
+    - "summarization"
+    - "table-question-answering"
+    - "text2text-generation"
+    - "text-classification" (alias "sentiment-analysis" available)
+    - "text-generation"
+    - "token-classification" (alias "ner" available)
+    - "translation"
+    - "translation_xx_to_yy"
+    - "zero-shot-classification"
+
+### Bug Fixes
+- Model Development: Fixed a bug when using simple imputer with numpy >= 1.25.
+- Model Development: Fixed a bug when inferring the type of label columns.
+
+### Behavior Changes
+- Model Registry: `log_model()` now return a `ModelReference` object instead of a model ID.
+- Model Registry: When deploying a model with 1 `target method` only, the `target_method` argument can be omitted.
+- Model Registry: When using the snowflake-ml-python with version newer than what is available in Snowflake Anaconda Channel, `embed_local_ml_library` option will be set as `True` automatically if not.
+- Model Registry: When deploying a model to Snowpark Container Services and using GPU, the default value of num_workers will be 1.
+- Model Registry: `keep_order` and `output_with_input_features` in the deploy options have been removed. Now the behavior is controlled by the type of the input when calling `model.predict()`. If the input is a `pandas.DataFrame`, the behavior will be the same as `keep_order=True` and `output_with_input_features=False` before. If the input is a `snowpark.DataFrame`, the behavior will be the same as `keep_order=False` and `output_with_input_features=True` before.
+- Model Registry: When logging and deploying PyTorch (`torch.nn.Module` and `torch.jit.ScriptModule`) and TensorFlow (`tensorflow.Module` and `tensorflow.keras.Model`) models, we no longer accept models whose input is a list of tensor and output is a list of tensors. Instead, now we accept models whose input is 1 or more tensors as positional arguments, and output is a tensor or a tuple of tensors. The input and output dataframe when predicting keep the same as before, that is every column is an array feature and contains a tensor.
+
+## 1.0.5 (2023-08-17)
 
 ### New Features
 
@@ -13,7 +48,7 @@
 - Model Registry: Fixed an issue that the UDF name created when deploying a model is not identical to what is provided and cannot be correctly dropped when deployment getting dropped.
 - connection_params.SnowflakeLoginOptions(): Added support for `private_key_path`.
 
-## 1.0.4
+## 1.0.4 (2023-07-28)
 
 ### New Features
 

diff --git a/bazel/environments/conda-env-build.yml b/bazel/environments/conda-env-build.yml
@@ -14,5 +14,5 @@ dependencies:
 - numpy==1.24.3
 - packaging==23.0
 - pyyaml==6.0
-- scikit-learn==1.2.2
+- scikit-learn==1.3.0
 - xgboost==1.7.3
diff --git a/bazel/environments/conda-env-snowflake.yml b/bazel/environments/conda-env-snowflake.yml
@@ -9,6 +9,7 @@ dependencies:
 - aiohttp==3.8.3
 - anyio==3.5.0
 - boto3==1.24.28
+- cachetools==4.2.2
 - cloudpickle==2.0.0
 - conda-libmamba-solver==23.3.0
 - coverage==6.3.2
@@ -23,6 +24,7 @@ dependencies:
 - lightgbm==3.3.5
 - mlflow==2.3.1
 - moto==4.0.11
+- multipledispatch==0.6.0
 - mypy==0.981
 - networkx==2.8.4
 - numpy==1.24.3
@@ -36,13 +38,14 @@ dependencies:
 - requests==2.29.0
 - ruamel.yaml==0.17.21
 - s3fs==2022.11.0
-- scikit-learn==1.2.2
+- scikit-learn==1.3.0
 - scipy==1.9.3
 - snowflake-connector-python==3.0.3
 - snowflake-snowpark-python==1.5.1
 - sqlparse==0.4.3
 - tensorflow==2.10.0
 - transformers==4.29.2
 - types-protobuf==4.23.0.1
+- types-requests==2.30.0.0
 - typing-extensions==4.5.0
 - xgboost==1.7.3
diff --git a/bazel/environments/conda-env.yml b/bazel/environments/conda-env.yml
@@ -9,9 +9,11 @@ dependencies:
 - aiohttp==3.8.3
 - anyio==3.5.0
 - boto3==1.24.28
+- cachetools==4.2.2
 - cloudpickle==2.0.0
 - conda-forge::starlette==0.27.0
 - conda-forge::types-PyYAML==6.0.12
+- conda-forge::types-cachetools==4.2.2
 - conda-libmamba-solver==23.3.0
 - coverage==6.3.2
 - cryptography==39.0.1
@@ -25,6 +27,7 @@ dependencies:
 - lightgbm==3.3.5
 - mlflow==2.3.1
 - moto==4.0.11
+- multipledispatch==0.6.0
 - mypy==0.981
 - networkx==2.8.4
 - numpy==1.24.3
@@ -39,13 +42,14 @@ dependencies:
 - requests==2.29.0
 - ruamel.yaml==0.17.21
 - s3fs==2022.11.0
-- scikit-learn==1.2.2
+- scikit-learn==1.3.0
 - scipy==1.9.3
 - snowflake-connector-python==3.0.3
 - snowflake-snowpark-python==1.5.1
 - sqlparse==0.4.3
 - tensorflow==2.10.0
 - transformers==4.29.2
 - types-protobuf==4.23.0.1
+- types-requests==2.30.0.0
 - typing-extensions==4.5.0
 - xgboost==1.7.3
diff --git a/ci/conda_recipe/meta.yaml b/ci/conda_recipe/meta.yaml
@@ -17,7 +17,7 @@ build:
   noarch: python
 package:
   name: snowflake-ml-python
-  version: 1.0.5
+  version: 1.0.6
 requirements:
   build:
     - python
@@ -34,7 +34,7 @@ requirements:
     - python
     - pyyaml>=6.0,<7
     - requests
-    - scikit-learn>=1.2.1,<1.3
+    - scikit-learn>=1.2.1,<1.4
     - scipy>=1.9,<2
     - snowflake-connector-python>=3.0.3,<4
     - snowflake-snowpark-python>=1.5.1,<2
@@ -43,8 +43,9 @@ requirements:
     - xgboost>=1.7.3,<2
   run_constrained:
     - lightgbm==3.3.5
-    - mlflow>=2.1.0,<3
+    - mlflow>=2.1.0,<2.4
     - tensorflow>=2.9,<3
     - torchdata>=0.4,<1
+    - transformers>=4.29.2,<5
 source:
   path: ../../
diff --git a/codegen/sklearn_wrapper_template.py_template b/codegen/sklearn_wrapper_template.py_template
@@ -25,6 +25,10 @@ from snowflake.snowpark import DataFrame, Session
 from snowflake.snowpark.functions import pandas_udf, sproc
 from snowflake.snowpark.types import PandasSeries
 from snowflake.snowpark._internal.type_utils import convert_sp_to_sf_type
+from snowflake.snowpark._internal.utils import (
+    TempObjectType,
+    random_name_for_temp_object,
+)
 
 from snowflake.ml.model.model_signature import (
     DataType,
@@ -244,7 +248,7 @@ class {transform.original_class_name}(BaseTransformer):
             cp.dump(self._sklearn_object, local_transform_file)
 
         # Create temp stage to run fit.
-        transform_stage_name = "SNOWML_TRANSFORM_{{safe_id}}".format(safe_id=self._get_rand_id())
+        transform_stage_name = random_name_for_temp_object(TempObjectType.STAGE)
         stage_creation_query = f"CREATE OR REPLACE TEMPORARY STAGE {{transform_stage_name}};"
         SqlResultValidator(
             session=session,
@@ -258,7 +262,7 @@ class {transform.original_class_name}(BaseTransformer):
         stage_result_file_name = posixpath.join(transform_stage_name, os.path.basename(local_transform_file_name))
         local_result_file_name = get_temp_file_path()
 
-        fit_sproc_name = "SNOWML_FIT_{{safe_id}}".format(safe_id=self._get_rand_id())
+        fit_sproc_name = random_name_for_temp_object(TempObjectType.PROCEDURE)
         statement_params = telemetry.get_function_usage_statement_params(
             project=_PROJECT,
             subproject=_SUBPROJECT,
@@ -439,8 +443,7 @@ class {transform.original_class_name}(BaseTransformer):
             pkg_versions=self._get_dependencies(), session=session, subproject=_SUBPROJECT)
 
         # Register vectorized UDF for batch inference
-        batch_inference_udf_name = "SNOWML_BATCH_INFERENCE_{{safe_id}}_{{method}}".format(
-                safe_id=self._get_rand_id(), method=inference_method)
+        batch_inference_udf_name = random_name_for_temp_object(TempObjectType.FUNCTION)
 
         # Need to do this since if we use self._sklearn_object directly in the UDF, Snowpark
         # will try to pickle all of self which fails.
@@ -701,8 +704,17 @@ class {transform.original_class_name}(BaseTransformer):
             expected_type_inferred = "{transform.udf_datatype}"
             # when it is classifier, infer the datatype from label columns
             if expected_type_inferred == "" and 'predict' in self.model_signatures:
+                # Batch inference takes a single expected output column type. Use the first columns type for now.
+                # TODO: Handle varying output column types.
+                label_cols_signatures = [row for row in self.model_signatures['predict'].outputs if row.name in self.output_cols]
+                if len(label_cols_signatures) == 0:
+                    error_str = f"Output columns {{self.output_cols}} do not match model signatures {{self.model_signatures['predict'].outputs}}."
+                    raise exceptions.SnowflakeMLException(
+                        error_code=error_codes.INVALID_ATTRIBUTE,
+                        original_exception=ValueError(error_str),
+                    )
                 expected_type_inferred = convert_sp_to_sf_type(
-                    self.model_signatures['predict'].outputs[0].as_snowpark_type()
+                    label_cols_signatures[0].as_snowpark_type()
                 )
 
             output_df = self._batch_inference(
@@ -955,7 +967,7 @@ class {transform.original_class_name}(BaseTransformer):
             cp.dump(self._sklearn_object, local_score_file)
 
         # Create temp stage to run score.
-        score_stage_name = "SNOWML_SCORE_{{safe_id}}".format(safe_id=self._get_rand_id())
+        score_stage_name = random_name_for_temp_object(TempObjectType.STAGE)
         session = dataset._session
         assert session is not None # keep mypy happy
         stage_creation_query = f"CREATE OR REPLACE TEMPORARY STAGE {{score_stage_name}};"
@@ -968,7 +980,7 @@ class {transform.original_class_name}(BaseTransformer):
 
         # Use posixpath to construct stage paths
         stage_score_file_name = posixpath.join(score_stage_name, os.path.basename(local_score_file_name))
-        score_sproc_name = "SNOWML_SCORE_{{safe_id}}".format(safe_id=self._get_rand_id())
+        score_sproc_name = random_name_for_temp_object(TempObjectType.PROCEDURE)
         statement_params = telemetry.get_function_usage_statement_params(
             project=_PROJECT,
             subproject=_SUBPROJECT,

diff --git a/requirements.yml b/requirements.yml
@@ -68,6 +68,7 @@
   version_requirements: ">=0.15,<2"
   tags:
     - build_essential
+    - deployment_core
 # For fsspec[http] in conda
 - name_conda: aiohttp
   dev_version_conda: "3.8.3"
@@ -123,7 +124,7 @@
     - build_essential
 - name: mlflow
   dev_version: "2.3.1"
-  version_requirements: ">=2.1.0,<3"
+  version_requirements: ">=2.1.0,<2.4"
   requirements_extra_tags:
     - mlflow
 - name: moto
@@ -176,8 +177,8 @@
 - name: s3fs
   dev_version: "2022.11.0"
 - name: scikit-learn
-  dev_version: "1.2.2"
-  version_requirements: ">=1.2.1,<1.3"
+  dev_version: "1.3.0"
+  version_requirements: ">=1.2.1,<1.4"
   tags:
     - build_essential
 - name: scipy
@@ -211,6 +212,11 @@
     - torch
 - name: transformers
   dev_version: "4.29.2"
+  version_requirements: ">=4.29.2,<5"
+  requirements_extra_tags:
+    - transformers
+- name: types-requests
+  dev_version: "2.30.0.0"
 - name: types-protobuf
   dev_version: "4.23.0.1"
 - name: types-PyYAML
@@ -226,3 +232,12 @@
   version_requirements: ">=1.7.3,<2"
   tags:
     - build_essential
+- name: types-cachetools
+  dev_version: "4.2.2"
+  from_channel: conda-forge
+- name: cachetools
+  dev_version: "4.2.2"
+# TODO: this will be a user side dep requirement
+# enable when we are releasing FS.
+- name: multipledispatch
+  dev_version: "0.6.0"