Skip to content

Commit b8c0af8

Browse files
authored
Merge branch 'databrickslabs:main' into feat/geo
2 parents 2c09e4b + c792bbe commit b8c0af8

33 files changed

+639
-98
lines changed

.github/workflows/acceptance.yml

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,8 @@ jobs:
4141
python-version: '3.12'
4242

4343
- name: Install hatch
44-
run: pip install hatch==1.9.4
44+
# click 8.3+ introduced bug for hatch
45+
run: pip install "hatch==1.13.0" "click<8.3"
4546

4647
- name: Run unit tests and generate test coverage report
4748
run: make test
@@ -93,7 +94,8 @@ jobs:
9394
python-version: '3.12'
9495

9596
- name: Install hatch
96-
run: pip install hatch==1.9.4
97+
# click 8.3+ introduced bug for hatch
98+
run: pip install "hatch==1.13.0" "click<8.3"
9799

98100
- name: Run integration tests on serverless cluster
99101
uses: databrickslabs/sandbox/acceptance@acceptance/v0.4.4
@@ -125,7 +127,8 @@ jobs:
125127
python-version: '3.12'
126128

127129
- name: Install hatch
128-
run: pip install hatch==1.9.4
130+
# click 8.3+ introduced bug for hatch
131+
run: pip install "hatch==1.13.0" "click<8.3"
129132

130133
- name: Install Databricks CLI
131134
run: |
@@ -177,7 +180,8 @@ jobs:
177180
python-version: '3.12'
178181

179182
- name: Install hatch
180-
run: pip install hatch==1.9.4
183+
# click 8.3+ introduced bug for hatch
184+
run: pip install "hatch==1.13.0" "click<8.3"
181185

182186
- name: Install Databricks CLI
183187
run: |

.github/workflows/docs-release.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,8 @@ jobs:
2828

2929
- name: Install Hatch
3030
run: |
31-
pip install hatch==1.9.4
31+
# click 8.3+ introduced bug for hatch
32+
pip install "hatch==1.13.0" "click<8.3"
3233
3334
- uses: actions/setup-node@v4
3435
with:

.github/workflows/downstreams.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,8 @@ jobs:
4343

4444
- name: Install toolchain
4545
run: |
46-
pip install hatch==1.9.4
46+
# click 8.3+ introduced bug for hatch
47+
pip install "hatch==1.13.0" "click<8.3"
4748
4849
- name: Check downstream compatibility
4950
uses: databrickslabs/sandbox/downstreams@downstreams/v0.0.1

.github/workflows/nightly.yml

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@ jobs:
3232
python-version: '3.12'
3333

3434
- name: Install hatch
35-
run: pip install hatch==1.9.4
35+
# click 8.3+ introduced bug for hatch
36+
run: pip install "hatch==1.13.0" "click<8.3"
3637

3738
- name: Run unit tests and generate test coverage report
3839
run: make test
@@ -81,7 +82,8 @@ jobs:
8182
python-version: '3.12'
8283

8384
- name: Install hatch
84-
run: pip install hatch==1.9.4
85+
# click 8.3+ introduced bug for hatch
86+
run: pip install "hatch==1.13.0" "click<8.3"
8587

8688
- name: Run integration tests on serverless cluster
8789
uses: databrickslabs/sandbox/acceptance@acceptance/v0.4.4
@@ -97,7 +99,6 @@ jobs:
9799
DATABRICKS_SERVERLESS_COMPUTE_ID: ${{ env.DATABRICKS_SERVERLESS_COMPUTE_ID }}
98100

99101
e2e:
100-
if: github.event_name == 'pull_request' && !github.event.pull_request.draft && !github.event.pull_request.head.repo.fork
101102
environment: tool
102103
runs-on: larger
103104
steps:
@@ -114,7 +115,8 @@ jobs:
114115
python-version: '3.12'
115116

116117
- name: Install hatch
117-
run: pip install hatch==1.9.4
118+
# click 8.3+ introduced bug for hatch
119+
run: pip install "hatch==1.13.0" "click<8.3"
118120

119121
- name: Install Databricks CLI
120122
run: |
@@ -147,7 +149,6 @@ jobs:
147149
ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
148150

149151
e2e_serverless:
150-
if: github.event_name == 'pull_request' && !github.event.pull_request.draft && !github.event.pull_request.head.repo.fork
151152
environment: tool
152153
runs-on: larger
153154
env:
@@ -166,7 +167,8 @@ jobs:
166167
python-version: '3.12'
167168

168169
- name: Install hatch
169-
run: pip install hatch==1.9.4
170+
# click 8.3+ introduced bug for hatch
171+
run: pip install "hatch==1.13.0" "click<8.3"
170172

171173
- name: Install Databricks CLI
172174
run: |
@@ -219,7 +221,8 @@ jobs:
219221
python-version: '3.12'
220222

221223
- name: Install hatch
222-
run: pip install hatch==1.9.4
224+
# click 8.3+ introduced bug for hatch
225+
run: pip install "hatch==1.13.0" "click<8.3"
223226

224227
- name: Login to Azure for azure-cli authentication
225228
uses: azure/login@v2

.github/workflows/performance.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,8 @@ jobs:
4343
cache-dependency-path: '**/pyproject.toml'
4444

4545
- name: Install hatch
46-
run: pip install hatch==1.9.4
46+
# click 8.3+ introduced bug for hatch
47+
run: pip install "hatch==1.13.0" "click<8.3"
4748

4849
- name: Login to Azure for azure-cli authentication
4950
uses: azure/login@v2

.github/workflows/push.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,8 @@ jobs:
3737

3838
- name: Run unit tests
3939
run: |
40-
pip install hatch==1.9.4
40+
# click 8.3+ introduced bug for hatch
41+
pip install "hatch==1.13.0" "click<8.3"
4142
make test
4243
4344
fmt:

.github/workflows/release.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,8 @@ jobs:
2727

2828
- name: Build wheels
2929
run: |
30-
pip install hatch==1.9.4
30+
# click 8.3+ introduced bug for hatch
31+
pip install "hatch==1.13.0" "click<8.3"
3132
hatch build
3233
3334
- name: Github release

demos/dqx_demo_tool.py

Lines changed: 39 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,14 @@
4444
# MAGIC summary_stats_file: profile_summary_stats.yml
4545
# MAGIC warehouse_id: your-warehouse-id
4646
# MAGIC ```
47+
# MAGIC
48+
# MAGIC If you install DQX using custom installation path you must update `custom_install_path` variable below. Installation using custom path is required when using [group assigned cluster](https://docs.databricks.com/aws/en/compute/group-access)!
49+
50+
# COMMAND ----------
51+
52+
# Updated the installation path if you install DQX in a custom folder!
53+
custom_install_path: str = ""
54+
dbutils.widgets.text("dqx_custom_installation_path", custom_install_path, "DQX Custom Installation Path")
4755

4856
# COMMAND ----------
4957

@@ -107,15 +115,22 @@
107115
import glob
108116
import os
109117

110-
user_name = spark.sql("select current_user() as user").collect()[0]["user"]
111-
default_dqx_installation_path = f"/Workspace/Users/{user_name}/.dqx"
118+
if custom_install_path:
119+
default_dqx_installation_path = custom_install_path
120+
print(f"Using custom installation path: {custom_install_path}")
121+
else:
122+
user_name = spark.sql("select current_user() as user").collect()[0]["user"]
123+
default_dqx_installation_path = f"/Workspace/Users/{user_name}/.dqx"
124+
print(f"Using default user's home installation path: {default_dqx_installation_path}")
125+
112126
default_dqx_product_name = "dqx"
113127

114128
dbutils.widgets.text("dqx_installation_path", default_dqx_installation_path, "DQX Installation Folder")
115129
dbutils.widgets.text("dqx_product_name", default_dqx_product_name, "DQX Product Name")
116130

117131
dqx_wheel_files_path = f"{dbutils.widgets.get('dqx_installation_path')}/wheels/databricks_labs_dqx-*.whl"
118132
dqx_wheel_files = glob.glob(dqx_wheel_files_path)
133+
119134
try:
120135
dqx_latest_wheel = max(dqx_wheel_files, key=os.path.getctime)
121136
except:
@@ -126,6 +141,10 @@
126141

127142
# COMMAND ----------
128143

144+
custom_install_path = dbutils.widgets.get('dqx_custom_installation_path') or None
145+
146+
# COMMAND ----------
147+
129148
# MAGIC %md
130149
# MAGIC ### Run profiler workflow to generate quality rule candidates
131150
# MAGIC
@@ -162,7 +181,9 @@
162181
dq_engine = DQEngine(ws)
163182

164183
# load the run configuration
165-
run_config = RunConfigLoader(ws).load_run_config(run_config_name="default", product_name=dqx_product_name)
184+
run_config = RunConfigLoader(ws).load_run_config(
185+
run_config_name="default", product_name=dqx_product_name, install_folder=custom_install_path
186+
)
166187

167188
# read the input data, limit to 1000 rows for demo purpose
168189
input_df = read_input_data(spark, run_config.input_config).limit(1000)
@@ -180,7 +201,10 @@
180201
print(yaml.safe_dump(checks))
181202

182203
# save generated checks to location specified in the default run configuration inside workspace installation folder
183-
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(run_config_name="default", product_name=dqx_product_name))
204+
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(
205+
run_config_name="default", product_name=dqx_product_name, install_folder=custom_install_path
206+
)
207+
)
184208

185209
# or save checks in arbitrary workspace location
186210
#dq_engine.save_checks(checks, config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))
@@ -245,7 +269,10 @@
245269
dq_engine = DQEngine(WorkspaceClient())
246270

247271
# save checks to location specified in the default run configuration inside workspace installation folder
248-
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(run_config_name="default", product_name=dqx_product_name))
272+
dq_engine.save_checks(checks, config=InstallationChecksStorageConfig(
273+
run_config_name="default", product_name=dqx_product_name, install_folder=custom_install_path
274+
)
275+
)
249276

250277
# or save checks in arbitrary workspace location
251278
#dq_engine.save_checks(checks, config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))
@@ -267,7 +294,9 @@
267294
dq_engine = DQEngine(WorkspaceClient())
268295

269296
# load the run configuration
270-
run_config = RunConfigLoader(ws).load_run_config(run_config_name="default", assume_user=True, product_name=dqx_product_name)
297+
run_config = RunConfigLoader(ws).load_run_config(
298+
run_config_name="default", assume_user=True, product_name=dqx_product_name, install_folder=custom_install_path
299+
)
271300

272301
# read the data, limit to 1000 rows for demo purpose
273302
bronze_df = read_input_data(spark, run_config.input_config).limit(1000)
@@ -276,8 +305,10 @@
276305
bronze_transformed_df = bronze_df.filter("vendor_id in (1, 2)")
277306

278307
# load checks from location defined in the run configuration
279-
280-
checks = dq_engine.load_checks(config=InstallationChecksStorageConfig(assume_user=True, run_config_name="default", product_name=dqx_product_name))
308+
checks = dq_engine.load_checks(config=InstallationChecksStorageConfig(
309+
assume_user=True, run_config_name="default", product_name=dqx_product_name, install_folder=custom_install_path
310+
)
311+
)
281312

282313
# or load checks from arbitrary workspace file
283314
#checks = dq_engine.load_checks(config=WorkspaceFileChecksStorageConfig(location="/Shared/App1/checks.yml"))

docs/dqx/docs/dev/contributing.mdx

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -333,6 +333,15 @@ git push --force-with-lease origin HEAD
333333

334334
If you encounter any package dependency errors after `git pull`, run `make clean`
335335

336+
### Resolving Hatch JSON TypeError
337+
338+
If you encounter an error like:
339+
```text
340+
TypeError: the JSON object must be str, bytes or bytearray, not Sentinel
341+
```
342+
343+
you can resolve it by downgrading the Click package to a compatible version that works with hatch: `pip install "click<8.3"`
344+
336345
### Common fixes for `mypy` errors
337346

338347
See https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html for more details

docs/dqx/docs/guide/quality_checks_storage.mdx

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,22 @@ import TabItem from '@theme/TabItem';
1212
DQX provides flexible methods to load and save quality checks (rules) defined as metadata (a list of dictionaries) from different storage backends, making it easier to manage, share, and reuse checks across workflows and environments.
1313

1414
Saving and loading methods accept a storage backend configuration as input. The following backend configuration are currently supported:
15-
- `FileChecksStorageConfig`: local files (JSON/YAML), or workspace files if invoked from Databricks notebook or job
16-
- `WorkspaceFileChecksStorageConfig`: workspace files (JSON/YAML) using absolute paths
17-
- `VolumeFileChecksStorageConfig`: Unity Catalog volumes (JSON/YAML file)
18-
- `TableChecksStorageConfig`: Unity Catalog tables
19-
- `InstallationChecksStorageConfig`: installation-managed location from the run config, ignores location and infers it from `checks_location` in the run config
15+
* `FileChecksStorageConfig`: local files (JSON/YAML), or workspace files if invoked from Databricks notebook or job. Containing fields:
16+
* `location`: absolute or relative file path in the local filesystem (JSON or YAML); also works with absolute or relative workspace file paths if invoked from Databricks notebook or job.
17+
* `WorkspaceFileChecksStorageConfig`: workspace files (JSON/YAML) using absolute paths. Containing fields:
18+
* `location`: absolute workspace file path (JSON or YAML).
19+
* `TableChecksStorageConfig`: Unity Catalog tables. Containing fields:
20+
* `location`: table fully qualified name.
21+
* `run_config_name`: (optional) run configuration name to load (use "default" if not provided).
22+
* `mode`: (optional) write mode for saving checks (`overwrite` or `append`, default is `overwrite`). The `overwrite` mode will only replace checks for the specific run config and not all checks in the table.
23+
* `VolumeFileChecksStorageConfig`: Unity Catalog volumes (JSON/YAML file). Containing fields:
24+
* `location`: Unity Catalog Volume file path (JSON or YAML).
25+
* `InstallationChecksStorageConfig`: installation-managed location from the run config, ignores location and infers it from `checks_location` in the run config. Containing fields:
26+
* `location` (optional): automatically set based on the `checks_location` field from the run configuration.
27+
* `install_folder`: (optional) installation folder where DQX is installed, only required when custom installation folder is used.
28+
* `run_config_name` (optional) - run configuration name to load (use "default" if not provided).
29+
* `product_name`: (optional) name of the product (use "dqx" if not provided).
30+
* `assume_user`: (optional) if True, assume user installation, otherwise global installation (skipped if `install_folder` is provided).
2031

2132
You can find details on how to define checks [here](/docs/guide/quality_checks_definition).
2233

0 commit comments

Comments
 (0)