Project import generated by Copybara. (#31)

snowflakedb · Jul 28, 2023 · 9eec61f · 9eec61f
1 parent 091fb6c
commit 9eec61f
Show file tree

Hide file tree

Showing 150 changed files with 8,551 additions and 4,620 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,17 @@
 # Release History
 
+## 1.0.4
+
+### New Features
+- Model Registry: Added support save/load/deploy Tensorflow models (`tensorflow.Module`).
+- Model Registry: Added support save/load/deploy MLFlow PyFunc models (`mlflow.pyfunc.PyFuncModel`).
+- Model Development: Input dataframes can now be joined against data loaded from staged files.
+- Model Development: Added support for non-English languages.
+
+### Bug Fixes
+
+- Model Registry: Fix an issue that model dependencies are incorrectly reported as unresolvable on certain platforms.
+
 ## 1.0.3 (2023-07-14)
 
 ### Behavior Changes

diff --git a/README.md b/README.md
@@ -3,6 +3,7 @@
 Snowpark ML is a set of tools including SDKs and underlying infrastructure to build and deploy machine learning models. With Snowpark ML, you can pre-process data, train, manage and deploy ML models all within Snowflake, using a single SDK, and benefit from Snowflake’s proven performance, scalability, stability and governance at every stage of the Machine Learning workflow.
 
 ## Key Components of Snowpark ML
+
 The Snowpark ML Python SDK provides a number of APIs to support each stage of an end-to-end Machine Learning development and deployment process, and includes two key components.
 
 ### Snowpark ML Development [Public Preview]
@@ -16,6 +17,7 @@ A collection of python APIs to enable efficient model development directly in Sn
 ### Snowpark ML Ops [Private Preview]
 
 Snowpark MLOps complements the Snowpark ML Development API, and provides model management capabilities along with integrated deployment into Snowflake. Currently, the API consists of
+
 1. FileSet API: FileSet provides a Python fsspec-compliant API for materializing data into a Snowflake internal stage from a query or Snowpark Dataframe along with a number of convenience APIs.
 
 1. Model Registry: A python API for managing models within Snowflake which also supports deployment of ML models into Snowflake Warehouses as vectorized UDFs.
@@ -25,15 +27,19 @@ During PrPr, we are iterating on API without backward compatibility guarantees.
 - [Documentation](https://docs.snowflake.com/developer-guide/snowpark-ml)
 
 ## Getting started
+
 ### Have your Snowflake account ready
+
 If you don't have a Snowflake account yet, you can [sign up for a 30-day free trial account](https://signup.snowflake.com/).
 
 ### Create a Python virtual environment
-Python 3.8 is required. You can use [miniconda](https://docs.conda.io/en/latest/miniconda.html), [anaconda](https://www.anaconda.com/), or [virtualenv](https://docs.python.org/3/tutorial/venv.html) to create a Python 3.8 virtual environment.
+
+Python version 3.8, 3.9 & 3.10 are supported. You can use [miniconda](https://docs.conda.io/en/latest/miniconda.html), [anaconda](https://www.anaconda.com/), or [virtualenv](https://docs.python.org/3/tutorial/venv.html) to create a virtual environment.
 
 To have the best experience when using this library, [creating a local conda environment with the Snowflake channel](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-packages.html#local-development-and-testing) is recommended.
 
 ### Install the library to the Python virtual environment
+
 ```
 pip install snowflake-ml-python
 ```
diff --git a/bazel/get_affected_targets.sh b/bazel/get_affected_targets.sh
@@ -28,8 +28,10 @@ help() {
 echo "Running ${PROG}"
 
 bazel="bazel"
-current_revision=$(git rev-parse HEAD)
-pr_revision=${current_revision}
+current_revision=$(git symbolic-ref --short -q HEAD \
+  || git describe --tags --exact-match 2> /dev/null \
+  || git rev-parse --short HEAD)
+pr_revision=$(git rev-parse HEAD)
 output_path="/tmp/affected_targets/targets"
 workspace_path=$(pwd)
 

diff --git a/bazel/mypy/CREDITS.md b/bazel/mypy/CREDITS.md
@@ -0,0 +1,3 @@
+Special thanks to [bazel-mypy-integration](https://github.com/bazel-contrib/bazel-mypy-integration).
+
+This package has been forked from that repo and modified to cater specific need of this Snowflake repo.
diff --git a/bazel/mypy/mypy.bzl b/bazel/mypy/mypy.bzl
@@ -1,54 +1,52 @@
+"Public API"
+
 load("@bazel_skylib//lib:shell.bzl", "shell")
 load("@bazel_skylib//lib:sets.bzl", "sets")
+load("//bazel/mypy:rules.bzl", "MyPyStubsInfo")
 
 MyPyAspectInfo = provider(
+    "TODO: documentation",
     fields = {
-        "out": "mypy output.",
-        "cache": "cache generated by mypy.",
+        "exe": "Used to pass the rule implementation built exe back to calling aspect.",
+        "out": "Used to pass the dummy output file back to calling aspect.",
     },
 )
 
-# We don't support stubs (pyi) yet.
-PY_EXTENSIONS = ["py"]
-PY_RULES = ["py_binary", "py_library", "py_test", "py_wheel", "py_package"]
+# Switch to True only during debugging and development.
+# All releases should have this as False.
+DEBUG = False
+
+VALID_EXTENSIONS = ["py", "pyi"]
 
 DEFAULT_ATTRS = {
-    "_mypy_sh": attr.label(
+    "_template": attr.label(
         default = Label("//bazel/mypy:mypy.sh.tpl"),
         allow_single_file = True,
     ),
-    "_mypy": attr.label(
+    "_mypy_cli": attr.label(
         default = Label("//bazel/mypy:mypy"),
         executable = True,
-        cfg = "host",
+        cfg = "exec",
     ),
     "_mypy_config": attr.label(
         default = Label("//:mypy.ini"),
         allow_single_file = True,
     ),
-    "_debug": attr.bool(
-        default = False,
-    )
 }
 
-# See https://github.com/python/mypy/pull/4759 for what `cache_map_triples` mean.
-def _sources_to_cache_map_triples(cache_files, dep_cache_files):
+def _sources_to_cache_map_triples(srcs):
     triples_as_flat_list = []
-    for d in (cache_files, dep_cache_files):
-        for src, (meta, data) in d.items():
-            triples_as_flat_list.extend([
-                shell.quote(src.path),
-                shell.quote(meta.path),
-                shell.quote(data.path),
-            ])
+    for f in srcs:
+        f_path = f.path
+        triples_as_flat_list.extend([
+            shell.quote(f_path),
+            shell.quote("{}.meta.json".format(f_path)),
+            shell.quote("{}.data.json".format(f_path)),
+        ])
     return triples_as_flat_list
 
-def _flatten_cache_dict(cache_files):
-    result = []
-    for meta, data in cache_files.values():
-        result.append(meta)
-        result.append(data)
-    return result
+def _is_external_dep(dep):
+    return dep.label.workspace_root.startswith("external/")
 
 def _is_external_src(src_file):
     return src_file.path.startswith("external/")
@@ -57,127 +55,142 @@ def _extract_srcs(srcs):
     direct_src_files = []
     for src in srcs:
         for f in src.files.to_list():
-            if f.extension in PY_EXTENSIONS and not _is_external_src(f):
+            if f.extension in VALID_EXTENSIONS:
                 direct_src_files.append(f)
     return direct_src_files
 
-# Overview
-# This aspect does the following:
-# - Create an action to run mypy against the sources of `target`
-#   - input of this action:
-#     - source files of `target` and source files of all its deps.
-#     - cache files produced by checking its deps.
-#   - output of this action:
-#     - mypy stderr+stdout in a file
-#     - cache files produced by checking the source files of `target`
-#   - this action depends on actions created for the deps, so that it always
-#     has access to cache files produced by those actions.
-# - Propagate the output of this action along the `deps` edge of the build graph.
-# - Produces a OutputGroup which contains the output of all the actions created
-#   along the build graph so that one can use bazel commandline to mark all those
-#   actions as required and to make them run.
-def _mypy_aspect_impl(target, ctx):
-    if (ctx.rule.kind not in PY_RULES or
-        ctx.label.workspace_root.startswith("external")):
-        return []
+def _extract_transitive_deps(deps):
+    transitive_deps = []
+    for dep in deps:
+        if MyPyStubsInfo not in dep and PyInfo in dep and not _is_external_dep(dep):
+            transitive_deps.append(dep[PyInfo].transitive_sources)
+    return transitive_deps
+
+def _extract_stub_deps(deps):
+    # Need to add the .py files AND the .pyi files that are
+    # deps of the rule
+    stub_files = []
+    for dep in deps:
+        if MyPyStubsInfo in dep:
+            for stub_srcs_target in dep[MyPyStubsInfo].srcs:
+                for src_f in stub_srcs_target.files.to_list():
+                    if src_f.extension == "pyi":
+                        stub_files.append(src_f)
+    return stub_files
+
+def _extract_imports(imports, label):
+    # NOTE: Bazel's implementation of this for py_binary, py_test is at
+    # src/main/java/com/google/devtools/build/lib/bazel/rules/python/BazelPythonSemantics.java
+    mypypath_parts = []
+    for import_ in imports:
+        if import_.startswith("/"):
+            # buildifier: disable=print
+            print("ignoring invalid absolute path '{}'".format(import_))
+        elif import_ in ["", "."]:
+            mypypath_parts.append(label.package)
+        else:
+            mypypath_parts.append("{}/{}".format(label.package, import_))
+    return mypypath_parts
+
+def _mypy_rule_impl(ctx):
     base_rule = ctx.rule
-    debug = ctx.attr._debug
-    mypy_config_file = ctx.file._mypy_config
 
-    # Get the cache files generated by running mypy against the deps.
-    dep_cache_files = {}
-    for dep in ctx.rule.attr.deps:
-        if MyPyAspectInfo in dep:
-            dep_cache_files.update(dep[MyPyAspectInfo].cache)
+    mypy_config_file = ctx.file._mypy_config
 
+    mypypath_parts = []
     direct_src_files = []
+    transitive_srcs_depsets = []
+    stub_files = []
+
     if hasattr(base_rule.attr, "srcs"):
         direct_src_files = _extract_srcs(base_rule.attr.srcs)
 
-    # It's possible that this target does not have srcs (py_wheel for example).
-    # However, if the user requests to type check a py_wheel, we should make sure
-    # its python transitive deps get checked.
-    if direct_src_files:
-        # There are source files in this target to check. The check will result in
-        # cache files. Request bazel to allocate those files now.
-        cache_files = {}
-        for src in direct_src_files:
-            meta_file = ctx.actions.declare_file("{}.meta.json".format(src.basename))
-            data_file = ctx.actions.declare_file("{}.data.json".format(src.basename))
-            cache_files[src] = (meta_file, data_file)
-
-
-        # The mypy stdout, which is expected to be produced by mypy_script.
-        mypy_out = ctx.actions.declare_file("%s_mypy_out" % ctx.rule.attr.name)
-        # The script to invoke mypy against this target.
-        mypy_script = ctx.actions.declare_file(
-            "%s_mypy_script" % ctx.rule.attr.name,
-        )
-
-        # Generated files are located in a different root dir than source files
-        # Thus we need to let mypy know where to find both kinds in case in one analysis
-        # both kinds are present.
-        src_root_paths = sets.to_list(
-            sets.make(
-                [f.root.path for f in dep_cache_files.keys()] +
-                [f.root.path for f in cache_files.keys()]),
-        )
-
-        all_src_files = direct_src_files + list(dep_cache_files.keys())
+    if hasattr(base_rule.attr, "deps"):
+        transitive_srcs_depsets = _extract_transitive_deps(base_rule.attr.deps)
+        stub_files = _extract_stub_deps(base_rule.attr.deps)
+
+    if hasattr(base_rule.attr, "imports"):
+        mypypath_parts = _extract_imports(base_rule.attr.imports, ctx.label)
+
+    final_srcs_depset = depset(transitive = transitive_srcs_depsets +
+                                            [depset(direct = direct_src_files)])
+    src_files = [f for f in final_srcs_depset.to_list() if not _is_external_src(f)]
+    if not src_files:
+        return None
+
+    mypypath_parts += [src_f.dirname for src_f in stub_files]
+    mypypath = ":".join(mypypath_parts)
+
+    out = ctx.actions.declare_file("%s_dummy_out" % ctx.rule.attr.name)
+    exe = ctx.actions.declare_file(
+        "%s_mypy_exe" % ctx.rule.attr.name,
+    )
+
+    # Compose a list of the files needed for use. Note that aspect rules can use
+    # the project version of mypy however, other rules should fall back on their
+    # relative runfiles.
+    runfiles = ctx.runfiles(files = src_files + stub_files + [mypy_config_file])
+
+    src_root_paths = sets.to_list(
+        sets.make([f.root.path for f in src_files]),
+    )
+
+    ctx.actions.expand_template(
+        template = ctx.file._template,
+        output = exe,
         substitutions = {
-            "{MYPY_BIN}": ctx.executable._mypy.path,
-            "{CACHE_MAP_TRIPLES}": " ".join(_sources_to_cache_map_triples(cache_files, dep_cache_files)),
+            "{MYPY_EXE}": ctx.executable._mypy_cli.path,
+            "{MYPY_ROOT}": ctx.executable._mypy_cli.root.path,
+            "{CACHE_MAP_TRIPLES}": " ".join(_sources_to_cache_map_triples(src_files)),
             "{PACKAGE_ROOTS}": " ".join([
                 "--package-root " + shell.quote(path or ".")
                 for path in src_root_paths
             ]),
             "{SRCS}": " ".join([
                 shell.quote(f.path)
-                for f in all_src_files
+                for f in src_files
             ]),
-            "{VERBOSE_OPT}": "--verbose" if debug else "",
-            "{VERBOSE_BASH}": "set -x" if debug else "",
-            "{OUTPUT}": mypy_out.path,
-            "{ADDITIONAL_MYPYPATH}": ":".join([p for p in src_root_paths if p]),
-            "{MYPY_INI}": mypy_config_file.path,
-        }
-        ctx.actions.expand_template(
-            template = ctx.file._mypy_sh,
-            output = mypy_script,
-            substitutions = substitutions,
-            is_executable = True,
-        )
-
-        # We want mypy to follow imports, so all the source files of the dependencies
-        # are need altoghther to check this target.
-        ctx.actions.run(
-            outputs = [mypy_out] + _flatten_cache_dict(cache_files),
-            inputs = depset(
-                all_src_files +
-                [mypy_config_file] +
-                _flatten_cache_dict(dep_cache_files)  # cache generated by analyzing deps
-            ),
-            tools = [ctx.executable._mypy],
-            executable = mypy_script,
-            mnemonic = "MyPy",
-            progress_message = "Type-checking %s" % ctx.label,
-            use_default_shell_env = True,
-        )
-        dep_cache_files.update(cache_files)
-    transitive_mypy_outs = []
-    for dep in ctx.rule.attr.deps:
-        if OutputGroupInfo in dep:
-            if hasattr(dep[OutputGroupInfo], "mypy"):
-                transitive_mypy_outs.append(dep[OutputGroupInfo].mypy)
+            "{VERBOSE_OPT}": "--verbose" if DEBUG else "",
+            "{VERBOSE_BASH}": "set -x" if DEBUG else "",
+            "{OUTPUT}": out.path if out else "",
+            "{MYPYPATH_PATH}": mypypath if mypypath else "",
+            "{MYPY_INI_PATH}": mypy_config_file.path,
+        },
+        is_executable = True,
+    )
+
+    return [
+        DefaultInfo(executable = exe, runfiles = runfiles),
+        MyPyAspectInfo(exe = exe, out = out),
+    ]
 
+def _mypy_aspect_impl(_, ctx):
+    if (ctx.rule.kind not in ["py_binary", "py_library", "py_test", "mypy_test"] or
+        ctx.label.workspace_root.startswith("external")):
+        return []
+
+    providers = _mypy_rule_impl(
+        ctx
+    )
+    if not providers:
+        return []
+
+    info = providers[0]
+    aspect_info = providers[1]
+
+    ctx.actions.run(
+        outputs = [aspect_info.out],
+        inputs = info.default_runfiles.files,
+        tools = [ctx.executable._mypy_cli],
+        executable = aspect_info.exe,
+        mnemonic = "MyPy",
+        progress_message = "Type-checking %s" % ctx.label,
+        use_default_shell_env = True,
+    )
     return [
         OutputGroupInfo(
-            # We may not need to run mypy against this target, but we request
-            # all its dependencies to be checked, recursively, but demanding the output
-            # of those checks.
-            mypy = depset([mypy_out] if direct_src_files else [], transitive=transitive_mypy_outs),
+            mypy = depset([aspect_info.out]),
         ),
-        MyPyAspectInfo(out = mypy_out if direct_src_files else None, cache = dep_cache_files),
     ]
 
 mypy_aspect = aspect(
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		Special thanks to [bazel-mypy-integration](https://github.com/bazel-contrib/bazel-mypy-integration).

		This package has been forked from that repo and modified to cater specific need of this Snowflake repo.