Project import generated by Copybara. (#92)

GitOrigin-RevId: cd1cf14167a03d4d572a86fb6162ba2d9d9e8457 Co-authored-by: Snowflake Authors <[email protected]>
snowflakedb · Mar 12, 2024 · 27431b2 · 27431b2
1 parent de45707
commit 27431b2
Show file tree

Hide file tree

Showing 108 changed files with 5,671 additions and 2,356 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,33 @@
 # Release History
 
-## 1.2.3
+## 1.3.0
+
+### Bug Fixes
+
+- Registry: Fix a bug that leads to module in `code_paths` when `log_model` cannot be correctly imported.
+- Registry: Fix incorrect error message when validating input Snowpark DataFrame with array feature.
+- Model Registry: Fix an issue when deploying a model to SPCS that some files do not have proper permission.
+- Model Development: Relax package versions for all inference methods if the installed version
+  is not available in the Snowflake conda channel
+
+### Behavior Changes
+
+- Registry: When running the method of a model, the value range based input validation to avoid input from overflowing
+  is now optional rather than enforced, this should improve the performance and should not lead to problem for most
+  kinds of model. If you want to enable this check as previous, specify `strict_input_validation=True` when
+  calling `run`.
+- Registry: By default `relax_version=True` when logging a model instead of using the specific local dependency versions.
+  This improves dependency versioning by using versions available in Snowflake. To switch back to the previous behavior
+  and use specific local dependency versions, specify `relax_version=False` when calling `log_model`.
+- Model Development: The behavior of `fit_predict` for all estimators is changed.
+  Firstly, it will cover all the estimator that contains this function,
+  secondly, the output would be the union of pandas DataFrame and snowpark DataFrame.
+
+### New Features
+
+- FileSet: `snowflake.ml.fileset.sfcfs.SFFileSystem` can now be serialized with `pickle`.
+
+## 1.2.3 (2024-02-26)
 
 ### Bug Fixes
 
@@ -23,11 +50,7 @@
   GridSearchCV, RandomizedSearchCV, PCA, IsolationForest, ...
 - Registry: Support deleting a version of a model.
 
-## 1.2.2
-
-### Bug Fixes
-
-### Behavior Changes
+## 1.2.2 (2024-02-13)
 
 ### New Features
 
@@ -38,23 +61,21 @@
   `snowflake.ml.model.models.huggingface_pipeline.HuggingFacePipelineModel` object, the following endpoints are required
   to be allowed: huggingface.com:80, huggingface.com:443, huggingface.co:80, huggingface.co:443.
 
-## 1.2.1
+## 1.2.1 (2024-01-25)
 
 ### New Features
 
 - Model Development: Infers output column data type for transformers when possible.
 - Registry: `relax_version` option is available in the `options` argument when logging the model.
 
-## 1.2.0
+## 1.2.0 (2024-01-11)
 
 ### Bug Fixes
 
 - Model Registry: Fix "XGBoost version not compiled with GPU support" error when running CPU inference against open-source
   XGBoost models deployed to SPCS.
 - Model Registry: Fix model deployment to SPCS on Windows machines.
 
-### Behavior Changes
-
 ### New Features
 
 - Model Development: Introduced XGBoost external memory training feature. This feature enables training XGBoost models
@@ -72,7 +93,7 @@
 `snowflake.ml.registry.Registry`, except when specifically required. The old model registry will be removed once all
 its primary functionalities are fully integrated into the new registry.
 
-## 1.1.2
+## 1.1.2 (2023-12-18)
 
 ### Bug Fixes
 
@@ -90,7 +111,7 @@ its primary functionalities are fully integrated into the new registry.
 
 - Model Development: SQL implementation of binary `precision_score` metric.
 
-## 1.1.1
+## 1.1.1 (2023-12-05)
 
 ### Bug Fixes
 
@@ -103,24 +124,22 @@ its primary functionalities are fully integrated into the new registry.
   requiring automatic input_cols inference, but need to avoid using specific
   columns, like index columns, during training or inference.
 
-## 1.1.0
+## 1.1.0 (2023-12-01)
 
 ### Bug Fixes
 
 - Model Registry: Fix panda dataframe input not handling first row properly.
 - Model Development: OrdinalEncoder and LabelEncoder output_columns do not need to be valid snowflake identifiers. They
   would previously be excluded if the normalized name did not match the name specified in output_columns.
 
-### Behavior Changes
-
 ### New Features
 
 - Model Registry: Add support for invoking public endpoint on SPCS service, by providing a "enable_ingress" SPCS
   deployment option.
 - Model Development: Add support for distributed HPO - GridSearchCV and RandomizedSearchCV execution will be
   distributed on multi-node warehouses.
 
-## 1.0.12
+## 1.0.12 (2023-11-13)
 
 ### Bug Fixes
 
@@ -145,7 +164,7 @@ its primary functionalities are fully integrated into the new registry.
 
 - Model Registry: Enable best-effort SPCS job/service log streaming when logging level is set to INFO.
 
-## 1.0.11
+## 1.0.11 (2023-10-27)
 
 ### New Features
 
@@ -164,7 +183,7 @@ its primary functionalities are fully integrated into the new registry.
 - Model Development: Fix metrics compatibility with Snowpark Dataframes that use Snowflake identifiers
 - Model Registry: Resolve 'delete_deployment' not deleting the SPCS service in certain cases.
 
-## 1.0.10
+## 1.0.10 (2023-10-13)
 
 ### Behavior Changes
 

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -46,13 +46,13 @@ Note: You may need to configure your editor to run this on save.
 To build the package, run:
 
 ```shell
-> bazel build //snowflake/ml:wheel
+> bazel build //:wheel
 ```
 
 `bazel` can be run from anywhere under the monorepo and it can accept absolute path or a relative path. For example,
 
 ```sh
-snowflake/ml> bazel build :wheel
+snowml> bazel build :wheel
 ```
 
 You can build an entire sub-tree as:

diff --git a/ci/conda_recipe/meta.yaml b/ci/conda_recipe/meta.yaml
@@ -17,7 +17,7 @@ build:
   noarch: python
 package:
   name: snowflake-ml-python
-  version: 1.2.3
+  version: 1.3.0
 requirements:
   build:
     - python
@@ -42,7 +42,7 @@ requirements:
     - scikit-learn>=1.2.1,<1.4
     - scipy>=1.9,<2
     - snowflake-connector-python>=3.0.4,<4
-    - snowflake-snowpark-python>=1.8.0,<2
+    - snowflake-snowpark-python>=1.8.0,<2,!=1.12.0
     - sqlparse>=0.4,<1
     - typing-extensions>=4.1.0,<5
     - xgboost>=1.7.3,<2

diff --git a/codegen/codegen_rules.bzl b/codegen/codegen_rules.bzl
@@ -94,6 +94,8 @@ def autogen_estimators(module, estimator_info_list):
                 "//snowflake/ml/modeling/_internal:estimator_utils",
                 "//snowflake/ml/modeling/_internal:model_trainer",
                 "//snowflake/ml/modeling/_internal:model_trainer_builder",
+                "//snowflake/ml/modeling/_internal:transformer_protocols",
+                "//snowflake/ml/modeling/_internal:model_transformer_builder",
             ],
         )
 

diff --git a/codegen/sklearn_wrapper_generator.py b/codegen/sklearn_wrapper_generator.py
@@ -154,18 +154,6 @@ def _is_classifier_obj(class_object: Tuple[str, type]) -> bool:
         """
         return WrapperGeneratorFactory._is_class_of_type(class_object[1], "ClassifierMixin")
 
-    @staticmethod
-    def _is_cluster_obj(class_object: Tuple[str, type]) -> bool:
-        """Check if the given estimator object can cluster features and conduct fit_predict methods.
-
-        Args:
-            class_object: Meta class object which needs to be checked.
-
-        Returns:
-            True if the class inherits from ClusterMixin, otherwise False.
-        """
-        return WrapperGeneratorFactory._is_class_of_type(class_object[1], "ClusterMixin")
-
     @staticmethod
     def _is_meta_estimator_obj(class_object: Tuple[str, type]) -> bool:
         """Check if the given estimator object requires an `estimator` parameter.
@@ -277,6 +265,33 @@ def _is_xgboost(module_name: str) -> bool:
         """
         return module_name.split(".")[0] == "xgboost"
 
+    @staticmethod
+    def _is_deterministic(class_object: Tuple[str, type]) -> bool:
+        """Checks if the given module is deterministic or not
+
+        Args:
+            class_object: Meta class object which needs to be checked.
+
+        Returns:
+            True if the class is deterministic, otherwise False.
+        """
+        return not (
+            WrapperGeneratorFactory._is_class_of_type(class_object[1], "LinearDiscriminantAnalysis")
+            or WrapperGeneratorFactory._is_class_of_type(class_object[1], "BernoulliRBM")
+        )
+
+    @staticmethod
+    def _is_deterministic_cross_platform(class_object: Tuple[str, type]) -> bool:
+        """Checks if the given module is deterministic or not across different platforms
+
+        Args:
+            class_object: Meta class object which needs to be checked.
+
+        Returns:
+            True if the class is deterministic across different platforms, otherwise False.
+        """
+        return not (WrapperGeneratorFactory._is_class_of_type(class_object[1], "Isomap"))
+
     @staticmethod
     def _is_lightgbm(module_name: str) -> bool:
         """Checks if the given module belongs to LightGBM package.
@@ -604,7 +619,6 @@ def __init__(self, module_name: str, class_object: Tuple[str, type]) -> None:
         self.test_estimator_imports_list: List[str] = []
 
         # Optional function support
-        self.fit_predict_cluster_function_support = False
         self.fit_transform_manifold_function_support = False
 
         # Dependencies
@@ -654,7 +668,6 @@ def _populate_flags(self) -> None:
         self._is_multioutput_estimator = WrapperGeneratorFactory._is_multioutput_estimator_obj(self.class_object)
         self._is_k_neighbors = WrapperGeneratorFactory._is_k_neighbors_obj(self.class_object)
         self._is_heterogeneous_ensemble = WrapperGeneratorFactory._is_heterogeneous_ensemble_obj(self.class_object)
-        self._is_cluster = WrapperGeneratorFactory._is_cluster_obj(self.class_object)
         self._is_stacking_ensemble = WrapperGeneratorFactory._is_stacking_ensemble_obj(self.class_object)
         self._is_voting_ensemble = WrapperGeneratorFactory._is_voting_ensemble_obj(self.class_object)
         self._is_chain_multioutput = WrapperGeneratorFactory._is_chain_multioutput_obj(self.class_object)
@@ -668,6 +681,10 @@ def _populate_flags(self) -> None:
         self._is_randomized_search_cv = WrapperGeneratorFactory._is_randomized_search_cv(self.class_object)
         self._is_iterative_imputer = WrapperGeneratorFactory._is_iterative_imputer(self.class_object)
         self._is_xgboost = WrapperGeneratorFactory._is_xgboost(self.module_name)
+        self._is_deterministic = WrapperGeneratorFactory._is_deterministic(self.class_object)
+        self._is_deterministic_cross_platform = WrapperGeneratorFactory._is_deterministic_cross_platform(
+            self.class_object
+        )
 
     def _populate_import_statements(self) -> None:
         self.estimator_imports_list.append("import numpy")
@@ -984,11 +1001,6 @@ def generate(self) -> "SklearnWrapperGenerator":
             ]
             self.test_estimator_input_args_list.append(f"dictionary={dictionary}")
 
-        if self._is_cluster:
-            self.fit_predict_cluster_function_support = True
-        if self._is_manifold:
-            self.fit_transform_manifold_function_support = True
-
         if self._is_manifold:
             self.fit_transform_manifold_function_support = True
 
@@ -998,12 +1010,10 @@ def generate(self) -> "SklearnWrapperGenerator":
 
         if "n_components" in self.original_init_signature.parameters.keys():
             if WrapperGeneratorFactory._is_class_of_type(self.class_object[1], "SpectralBiclustering"):
-                # For spectral bi clustering, set number of sigular vertors to consider to number of input cols and
+                # For spectral bi clustering, set number of singular vectors to consider to number of input cols and
                 # num best vector to select to half the number of input cols.
                 self.test_estimator_input_args_list.append("n_components=len(cols)")
                 self.test_estimator_input_args_list.append("n_best=int(len(cols)/2)")
-            else:
-                self.test_estimator_input_args_list.append("n_components=1")
 
         if self._is_heterogeneous_ensemble:
             if self._is_regressor: