feat(datasets): SparkDataset Rewrite #1185

SajidAlamQB · 2025-09-11T15:05:02Z

Description

This PR introduces SparkDatasetV2 an alternative cleaner version of SparkDataset.

The SparkDataset is currently frustrating to work with for several reasons that have been outlined in #135.

It invents its own filepath parsing (split_filepath vs. the normal get_protocol_and_path), leading to duplication and inconsistencies (e.g., S3 paths handle "s3a://" differently).
The codebase has multiple pathways (fsspec for metadata ops like exists/glob, Spark for data I/O), which work for S3/DBFS but break for others (e.g., GCS isn't directly supported but could be via fsspec).
Requires spark-base, hdfs-base, s3fs-base—installs ~300MB pyspark even on Databricks, conflicting with databricks-connect.

Dependency Issues:

From pyproject.toml: spark-sparkdataset = ["kedro-datasets[spark-base,hdfs-base,s3fs-base]"] forces all three, but HDFS is rarely used nowadays. Databricks datasets rely on SparkDataset's parsing utils, creating circular deps.
pyproject.toml lumps everything together (e.g., spark-base includes delta-base, even for non-Delta use) which results in over installation (e.g., pyspark on Databricks clusters), import conflicts, and low adoption in cloud-native setups.

Testing and Bugs:

Tests are patchy (e.g., HDFS tests may fail due to mocking/setup issues; S3 uses moto but not comprehensively).
Filepath rewriting (strip_dbfs_prefix) in load/save but not consistent with versioning.
Overwrite protections fail for versioned datasets; exists() relies on Spark reads.
Differs from other Kedro datasets (e.g., uses Spark I/O directly, not fsspec fully).
HDFS uses InsecureClient in beta; Databricks warnings are ignored in tests.

Development notes

This PR introduces SparkDatasetV2 to:

Dependency Restructuring

Introduce spark-core with zero dependencies
Separate environment specific installations (spark-local, spark-databricks, spark-emr)
Make filesystem dependencies optional (spark-s3, spark-gcs, spark-azure)
Remove forced PySpark installation for Databricks users

Code Improvements for `SparkDataset`

Use TYPE_CHECKING for lazy PySpark imports
Leverage fsspec for filesystem operations except DBFS where we keep the dbutils for performance
Add proper path translation between fsspec and Spark protocols (s3:// → s3a://)
Remove custom filepath parsing in favor of get_protocol_and_path
Remove HDFS custom client dependency (available via optional spark-hdfs with PyArrow if needed)

Now:

Databricks users: No more PySpark conflicts with databricks-connect
Reduced installation size: ~310MB → 0MB for cloud environments
Clearer installation paths based on environment
Users relying on kedro-datasets[spark] will need to choose specific bundles
HDFS support is deprecated (still available via spark-hdfs)

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Updated jsonschema/kedro-catalog-X.XX.json if necessary
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Sajid Alam <[email protected]>

deepyaman

Digging in a bit, this feels less like a rewrite and more like a refactoring. Here are my initial thoughts:

I've added a comment re my concern removing the _dbfs_glob logic. This needs to be validated carefully (perhaps Databricks improved performance of regular glob?) so we don't reintroduce a performance issue. I remember debugging this on a client project, because IIRC (it's been years) performance degrades to the point of unusability with a large number of versions.
Will this provide the best experience with spark-connect and databricks-connect? (FWIW databricks-connect is a bit annoying to look into since it's not open source.) Spark 3.4 introduced Spark Connect, and Spark 4 includes major refactors to really make it part of the core (e.g. pyspark.sql.classic is moved to the same level as pyspark.sql.connect, and they inherit from the same base DataFrame and all—wasn't the case before). IMO Spark Connect looks like the future of Spark, and a SparkDataset refresh should work seamlessly with it. Spark Connect (and Databricks Connect) are also potentially great for users who struggle with the deployment experience (e.g. need to get code onto Databricks from local). That said, the classic experience is still likely a very common way for teams who are working more from within Databricks to operate.
I like the fact that HDFS is supported through PyArrow now. If there's still concern that people may need the old, separate HDFS client (not sure there is? hdfs hasn't had a release in two years and doesn't support Python 3.13 for example), maybe that could be handled through some sort of fallback logic?

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py

kedro-datasets/pyproject.toml

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py

SajidAlamQB · 2025-09-26T09:08:02Z

Thanks @deepyaman you're right about the DBFS glob issue that is a good catch we'll add that back in. Regarding refactor vs rewrite, we chose V2 for safety, but I'm open to discussing whether we should refactor original instead if you think that's better.

deepyaman · 2025-09-26T15:01:21Z

Regarding refactor vs rewrite, we chose V2 for safety, but I'm open to discussing whether we should refactor original instead if you think that's better.

Yeah, if course. I think can get the V2 "ready", and then see if it's sufficiently different that it needs to be breaking/a separate dataset.

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB · 2025-09-29T13:33:18Z

@noklam would also appreciate your thoughts on this.

noklam

Sorry I don't have time to review this in details but I don't want to block this. A quick question top of my head:

When should I use databricks specific dataset and spark.SparkDataset on Databricks? I recalled there are already something that is only possible with the databricks one. If we are re-writing this I think we should have a look at this.
dbfs is a bit annoying - Databricks already deprecated it, new cluster are default to UC's volume but still a lot of people are using dbfs in older cluster.
Is there a goal/additional things that this rewrite improve? Or is it more like refactoring?

SajidAlamQB · 2025-10-01T13:12:03Z

When should I use databricks specific dataset and spark.SparkDataset on Databricks? I recalled there are already something that is only possible with the databricks one. If we are re-writing this I think we should have a look at this.

dbfs is a bit annoying - Databricks already deprecated it, new cluster are default to UC's volume but still a lot of people are using dbfs in older cluster.

Is there a goal/additional things that this rewrite improve? Or is it more like refactoring?

Hey @noklam thanks,

I think the Databricks datasets are more for TABLE operations while the SparkDataset is for FILE operations.

The new V2 handles both DBFS and UC Volumes properly, they still supports /dbfs/, dbfs:/, and /Volumes/ paths and we only do the DBFS optimisations only when needed.

This goes a bit beyond a refactor I think, we solving some long standing issues such as the Databricks users can now actually use it, we add Spark Connect for Spark 4.0 and now the users can choose their deps instead of installing everything via pyproject.toml changes. It makes the dataset more usable.

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py

Signed-off-by: Sajid Alam <[email protected]>

…rg/kedro-plugins into dev/sparkdataset-rewrite Signed-off-by: Sajid Alam <[email protected]>

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB · 2025-10-16T13:35:34Z

Hey team, this PR is ready for another review round. Since the last discussion:

Re-added DBFS glob optimisations for performance
Added comprehensive tests for path handling
Tested Spark Connect compatibility

Open questions:

Should we deprecate the original or maintain both versions for now?
Need validation on Databricks/EMR environments
Since this is feeling more like a refactor should we keep as V2 or replace original?

Would appreciate reviews from @deepyaman @noklam @merelcht @ravi-kumar-pilla on the latest changes .

Also I'm not familiar with all these platforms could someone with access test this dataset on:

Databricks (both DBFS and UC Volumes)
AWS EMR
GCP

ravi-kumar-pilla · 2025-10-16T14:37:15Z

Also I'm not familiar with all these platforms could someone with access test this dataset on:

Databricks (both DBFS and UC Volumes)

AWS EMR

GCP

Hi @SajidAlamQB , I tested locally before and can test again. I will also test on Databricks. I am not familiar with AWS EMR or GCP but I can try. Thank you

ravi-kumar-pilla · 2025-10-20T02:25:46Z

Hi @SajidAlamQB ,

I have done some basic testing on databricks. Below is my notebook which I tried after a basic setup using UC volumes:

from kedro_datasets.spark import SparkDataset, SparkDatasetV2

test_catalog = "spark_v2_catalog"
test_schema = "default"
volume_name = "spark_test_volume"

uc_base_path = f"/Volumes/{test_catalog}/{test_schema}/{volume_name}"
temp_dir = f"{uc_base_path}/basic"
dbutils.fs.mkdirs(temp_dir)

# Test 1: Parquet format (fails with SparkDatasetV2 but passes with SparkDataset)
parquet_path = f"{temp_dir}/basic_test.parquet"
dataset_parquet = SparkDatasetV2(filepath=parquet_path, file_format="parquet")
    
start_time = time.time()
dataset_parquet.save(test_df)
save_time = time.time() - start_time
print(f"✅ Saved parquet in {save_time:.3f}s to: {parquet_path}")
    
start_time = time.time()
loaded_df = dataset_parquet.load()
load_time = time.time() - start_time
print(f"✅ Loaded parquet in {load_time:.3f}s - {loaded_df.count()} rows")
    
# Verify data integrity
assert test_df.count() == loaded_df.count(), "Row count mismatch"
print("✅ Data integrity verified")

# Test 2: CSV with custom args (fails with SparkDatasetV2 but passes with SparkDataset)
print("\n📁 Test 2: CSV with Custom Arguments")
csv_path = f"{temp_dir}/basic_test.csv"
dataset_csv = SparkDataset(
    filepath=csv_path, 
    file_format="csv",
    save_args={"header": True, "sep": "|"},
    load_args={"header": True, "sep": "|", "inferSchema": True}
)

dataset_csv.save(test_df)
loaded_csv_df = dataset_csv.load()
print(f"✅ CSV save/load successful - {loaded_csv_df.count()} rows")
loaded_csv_df.show(3)

# Test 3: JSON format (fails with SparkDatasetV2 but passes with SparkDataset)
print("\n📁 Test 3: JSON Format")
json_path = f"{temp_dir}/basic_test.json"
dataset_json = SparkDataset(filepath=json_path, file_format="json")

# DatasetError: Failed while saving data to dataset kedro_datasets.spark.spark_dataset_v2.SparkDatasetV2(filepath='file:///Volumes/spark_v2_catalog/default/spark_test_volume/basic/basic_test.csv', file_format='csv', load_args={'header': True, 'sep': '|', 'inferSchema': True}, save_args={'header': True, 'sep': '|'}, protocol='file').
# (java.lang.SecurityException) Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden

Observations:

For UC volumes there is no issue with SparkDataset but SparkDatasetV2 fails with the errorlocal filesystem access is forbidden. Detailed error in the notebook
For DBFS both SparkDataset and SparkDatasetV2 fails with the error Public DBFS root is disabled. Access is denied on path. I am not sure if I am doing something wrong. This is my path - dbfs:/FileStore/spark_dataset_test/ and /dbfs/FileStore/spark_dataset_test/
%pip install hdfs s3fs is required for SparkDataset but SparkDatasetV2 does not require them (dependency restructuring works great)
Code looks good to me.

Thank you

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB · 2025-10-20T13:06:06Z

For UC volumes there is no issue with SparkDataset but SparkDatasetV2 fails with the errorlocal filesystem access is forbidden. Detailed error in the notebook

Hey @ravi-kumar-pilla, thanks for the detailed testing, I've pushed a fix for the UC volume issues, could you retest it, thank you!

Signed-off-by: Sajid Alam <[email protected]>

ravi-kumar-pilla · 2025-10-22T05:23:54Z

Hey @ravi-kumar-pilla, thanks for the detailed testing, I've pushed a fix for the UC volume issues, could you retest it, thank you!

The UC volumes works. Though I am still receiving errors with DBFS. Lets connect tomorrow to discuss on this. Thank you

ravi-kumar-pilla · 2025-10-27T14:34:57Z

Hi @SajidAlamQB , a related ticket we should address in this ticket if possible - #216 , #1210

…rg/kedro-plugins into dev/sparkdataset-rewrite Signed-off-by: Sajid Alam <[email protected]>

rework spark in pyproject.toml

6623c73

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB changed the title ~~SparkDataset Rewrite~~ chore(datasets): SparkDataset Rewrite Sep 22, 2025

SajidAlamQB and others added 2 commits September 22, 2025 11:21

Merge branch 'main' into dev/sparkdataset-rewrite

a424044

Update pyproject.toml

e28b980

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB changed the title ~~chore(datasets): SparkDataset Rewrite~~ feat(datasets): SparkDataset Rewrite Sep 22, 2025

SajidAlamQB added 8 commits September 22, 2025 12:55

Update spark_dataset.py

27f9bd3

Signed-off-by: Sajid Alam <[email protected]>

Update spark_dataset.py

c82f440

Signed-off-by: Sajid Alam <[email protected]>

lint

460cf31

Signed-off-by: Sajid Alam <[email protected]>

Update spark_dataset.py

7514ccb

Signed-off-by: Sajid Alam <[email protected]>

Update test_spark_dataset.py

e98ce2a

Signed-off-by: Sajid Alam <[email protected]>

lint

f800f33

Signed-off-by: Sajid Alam <[email protected]>

revert and split sparkdataset rewrite into v2

c74ff12

Signed-off-by: Sajid Alam <[email protected]>

Update test_spark_dataset_v2.py

5ebd4f1

Signed-off-by: Sajid Alam <[email protected]>

deepyaman reviewed Sep 24, 2025

View reviewed changes

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py Outdated Show resolved Hide resolved

kedro-datasets/pyproject.toml Show resolved Hide resolved

ravi-kumar-pilla reviewed Sep 25, 2025

View reviewed changes

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py Show resolved Hide resolved

Merge branch 'main' into dev/sparkdataset-rewrite

85958f1

SajidAlamQB added 2 commits September 29, 2025 14:13

changes based on feedback

d069256

Signed-off-by: Sajid Alam <[email protected]>

lint

fa13847

Signed-off-by: Sajid Alam <[email protected]>

Merge branch 'main' into dev/sparkdataset-rewrite

e536bba

noklam self-requested a review September 30, 2025 15:34

noklam reviewed Sep 30, 2025

View reviewed changes

ravi-kumar-pilla added 2 commits October 13, 2025 00:55

Merge branch 'main' into dev/sparkdataset-rewrite

d6b56c5

Merge branch 'main' into dev/sparkdataset-rewrite

c05b8be

ravi-kumar-pilla reviewed Oct 14, 2025

View reviewed changes

kedro-datasets/kedro_datasets/spark/spark_dataset_v2.py Show resolved Hide resolved

SajidAlamQB added 2 commits October 15, 2025 09:11

Update __init__.py

fe52b19

Signed-off-by: Sajid Alam <[email protected]>

fix tests

cc9a747

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB and others added 5 commits October 15, 2025 10:52

Merge branch 'main' into dev/sparkdataset-rewrite

8a3ddeb

fix docstring and lint

9906556

Signed-off-by: Sajid Alam <[email protected]>

Merge branch 'dev/sparkdataset-rewrite' of https://github.com/kedro-o…

bfecfc9

…rg/kedro-plugins into dev/sparkdataset-rewrite Signed-off-by: Sajid Alam <[email protected]>

Update spark_dataset_v2.py

9a7d73f

Signed-off-by: Sajid Alam <[email protected]>

Update spark_dataset_v2.py

ad18aa7

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB requested a review from merelcht October 16, 2025 11:56

SajidAlamQB marked this pull request as ready for review October 16, 2025 13:35

fix unity catalog

56a497d

Signed-off-by: Sajid Alam <[email protected]>

SajidAlamQB and others added 3 commits October 20, 2025 14:07

Merge branch 'main' into dev/sparkdataset-rewrite

bd5cb21

Update spark_dataset_v2.py

a626476

Signed-off-by: Sajid Alam <[email protected]>

Merge branch 'main' into dev/sparkdataset-rewrite

3414371

Merge branch 'dev/sparkdataset-rewrite' of https://github.com/kedro-o…

144c6af

…rg/kedro-plugins into dev/sparkdataset-rewrite Signed-off-by: Sajid Alam <[email protected]>

feat(datasets): SparkDataset Rewrite #1185

Are you sure you want to change the base?

feat(datasets): SparkDataset Rewrite #1185

Uh oh!

Conversation

SajidAlamQB commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Dependency Issues:

Testing and Bugs:

Development notes

Dependency Restructuring

Code Improvements for SparkDataset

Checklist

Uh oh!

deepyaman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SajidAlamQB commented Sep 26, 2025

Uh oh!

deepyaman commented Sep 26, 2025

Uh oh!

SajidAlamQB commented Sep 29, 2025

Uh oh!

noklam left a comment

Choose a reason for hiding this comment

Uh oh!

SajidAlamQB commented Oct 1, 2025

Uh oh!

Uh oh!

SajidAlamQB commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ravi-kumar-pilla commented Oct 16, 2025

Uh oh!

ravi-kumar-pilla commented Oct 20, 2025

Uh oh!

SajidAlamQB commented Oct 20, 2025

Uh oh!

ravi-kumar-pilla commented Oct 22, 2025

Uh oh!

ravi-kumar-pilla commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SajidAlamQB commented Sep 11, 2025 •

edited

Loading

Code Improvements for `SparkDataset`

SajidAlamQB commented Oct 16, 2025 •

edited

Loading

ravi-kumar-pilla commented Oct 27, 2025 •

edited

Loading