-
Notifications
You must be signed in to change notification settings - Fork 107
feat(datasets): make table write mode configurable #1093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(datasets): make table write mode configurable #1093
Conversation
…ble materialization - feat: Introduces a `mode` parameter for save operations, allowing "append", "overwrite", "error", and "ignore" options to control write behavior. - Supports legacy `overwrite` flag with backward compatibility and deprecation warning. - Adds mode dispatching logic to handle different write scenarios. - Updates examples and docs to reflect new parameter. - Prevents simultaneous use of both `mode` and legacy `overwrite` to avoid ambiguity. - docs: Improves documentation and usage examples for new save modes. Semi-breaking change: Replaces `overwrite` save argument with `mode`; users should update configurations to use `mode`. Signed-off-by: gitgud5000 <[email protected]>
2e9e539 to
a41bb19
Compare
Signed-off-by: gitgud5000 <[email protected]>
a41bb19 to
1f609c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Did a quick-pass review. I see you're still actively updating, so I'll revisit this later.
|
|
||
| self._materialized = self._save_args.pop("materialized") | ||
|
|
||
| # Handle mode / overwrite conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if insert + view needs to be disallowed; I haven't actually checked yet.
|
I've been activily working with this because I need the features now 😅. I've also added support for passing credentials to the |
ibis.TableDataset add configurable save mode for table materializationibis.TableDataset add configurable save mode for table materialization and credentials support
…precate connection - feat: introduce a new `credentials` parameter to accept connection info as a string or dict, preferred over the deprecated `connection` param - warn if both `credentials` and `connection` are provided, prioritizing `credentials` - support connection strings and dicts with connection strings for backend connections - feat: add backend name extraction to be used in `_describe()` method - docs: update docstrings to explain `credentials` and deprecation of `connection` !Semi- Breaking Change: deprecates the `connection` parameter in favor of `credentials` Signed-off-by: gitgud5000 <[email protected]>
…lacing the function dispatch dict with direct conditional handling, improving clarity and maintainability - docs: expands and clarifies docstring to explain available save modes and their effects, referencing Spark semantics for familiarity Signed-off-by: gitgud5000 <[email protected]>
Signed-off-by: gitgud5000 <[email protected]>
d545871 to
6445239
Compare
Best reason. 😁
Sorry, I didn't check in time; ideal is a separate PR, but it's probably no big deal to have it in the same. If it ends up being a blocker to get something merged, can always split it out later. |
- fix: add early return for empty DataFrame - fix: ensure table creation occurs before insert operations in append mode when table doesn't exist Signed-off-by: gitgud5000 <[email protected]>
40bbfe3 to
6130581
Compare
|
Could you please take a look at this pull request, @ankatiyar @ravi-kumar-pilla? |
|
@gitgud5000 I'll take a look, in the meanwhile could you add some unit tests to go along with the changes? :D |
- fix: remove lingering 'mode' key when mapping legacy overwrite to prevent unexpected writer kwargs - fix: treat empty pandas DataFrame as a no-op in save, supporting both pandas and ibis tables Prevents accidental parameter leakage and avoids errors when saving empty data. Signed-off-by: gitgud5000 <[email protected]>
…and legacy overwrite behavior Signed-off-by: gitgud5000 <[email protected]>
db200f3 to
55b8666
Compare
… new save modes and credentials parameter Signed-off-by: gitgud5000 <[email protected]>
Signed-off-by: gitgud5000 <[email protected]>
Signed-off-by: gitgud5000 <[email protected]>
|
@ankatiyar Done ✅ I wasn’t able to resolve a couple of things:
=========================== short test summary info ============================
FAILED kedro-datasets/tests/ibis/test_table_dataset.py::TestTableDataset::test_connection_config[None-None-None-connection_config1-key1]
|
Signed-off-by: Ankita Katiyar <[email protected]>
63b95c1 to
4c6ddb9
Compare
Signed-off-by: Deepyaman Datta <[email protected]>
1057b14 to
513aab0
Compare
ibis.TableDataset add configurable save mode for table materialization and credentials supportSigned-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
62d976a to
55ba8a0
Compare
|
The code looks good to me! I'm wondering if we need to support |
|
yes, the next release will likely be a major one (given the new GenAI datasets), so doing the breaking change now could make sense. That said, I’m not sure how widely the Ibis dataset is used or whether existing users would expect backward compatibility / a deprecation phase before overwrite is removed. If usage is low, it might be fine to go straight to mode and make it part of the next major. |
I believe there are a fair number of users, and I'd rather keep the deprecation warning given that (1) it's already implemented and (2) it's not some complicated functionality to maintain, and there are no significant tradeoffs/downsides to having the functionality.
What is the timeline for the next release? @gitgud5000 also implemented credentials support as part of this PR that we were split out/are moving to a follow up. I assume that would also be good to get in for the next release. |
Signed-off-by: Deepyaman Datta <[email protected]>
586c24c to
6912ec6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @gitgud5000 and @deepyaman 💯
…de argument (PR kedro-org#1093) and credentials support in the upcoming changes section. Signed-off-by: gitgud5000 <[email protected]>
* Add credentials parameter to TableDataset for authentication Signed-off-by: gitgud5000 <[email protected]> * Add tests for TableDataset credentials functionality Signed-off-by: gitgud5000 <[email protected]> * test(ibis): suppress false positive secret warning Signed-off-by: gitgud5000 <[email protected]> * style(tests): apply lint and formatting fixes Signed-off-by: gitgud5000 <[email protected]> * fix(ibis): add credentials type validation to TableDataset. Appease mypy and surface explicit errors for invalid credential inputs. Signed-off-by: gitgud5000 <[email protected]> * fix(ibis): Fix formatting and lint issues Signed-off-by: gitgud5000 <[email protected]> * fix: secret false positive Signed-off-by: gitgud5000 <[email protected]> * fix: formatting Signed-off-by: gitgud5000 <[email protected]> * fix(tests): remove commented-out references from upstream examples Signed-off-by: gitgud5000 <[email protected]> * fix(RELEASE.md): update documentation for `ibis.TableDataset` with mode argument (PR #1093) and credentials support in the upcoming changes section. Signed-off-by: gitgud5000 <[email protected]> * fix: docs Signed-off-by: gitgud5000 <[email protected]> * fix: docs... Signed-off-by: gitgud5000 <[email protected]> * Address comments by @deepyaman: simplify `TableDataset` credentials merging; update docs and tests Signed-off-by: gitgud5000 <[email protected]> * fix credentials type annotation to resolves a mypy false positive. Signed-off-by: gitgud5000 <[email protected]> * feat(schema): add optional credentials object to catalog JSON schema Signed-off-by: gitgud5000 <[email protected]> * fix test determine backend from dataset config and add credentials-override case Signed-off-by: gitgud5000 <[email protected]> --------- Signed-off-by: gitgud5000 <[email protected]> Signed-off-by: Deepyaman Datta <[email protected]> Co-authored-by: Deepyaman Datta <[email protected]>
TODO(deepyaman): Move notes on
credentialssupport to a new PR from @gitgud5000 with those changes, once it exists. I'm not deleting the detailed notes from here until they're moved elsewhere.feat(datasets): ibis.TableDataset add
modeandcredentialssupportmodeparameter for save operations, allowing "append", "overwrite", "error", and "ignore" options to control write behavior.overwriteflag with backward compatibility and deprecation warning.modeand legacyoverwriteto avoid ambiguity.credentialsparameter to accept connection info (dict or string URI), supersedingconnection.credentialsand deprecatedconnectionare provided._describe()to use it.Semi-breaking change:
overwriteandconnectionare deprecated; users should migrate tomodeandcredentials.Description
This PR introduces two key enhancements to
ibis.TableDataset:Configurable Save Modes:
modeparameter tosave_argssimilar to Spark’sDataFrameWriter.modeand Pandas’to_csv(mode=...), with support for:"append": Insert data into an existing table (requires the backend to implementinsert())."overwrite": Drop and recreate the table/view."error"or"errorifexists": Fail if the table already exists."ignore": Do nothing if the table exists; otherwise, create it.overwrite=True|Falsemaps tomode="overwrite"|"error".modeandoverwriteare specified simultaneously.credentialsParameter:credentialsas the preferred method for specifying Ibis backend connection configurations.constring).connectionparameter._connect()and_describe()to support new functionality.Development notes
modeargument).DataFrameWriter.modeand Pandas’sto_csv(mode=…).overwritebehavior.ibis.connect()Expand for details:
Save Mode Handling
DEFAULT_SAVE_ARGSupdated to usemode="overwrite"by default.__init__:mode/overwritepresence.overwrite→modeinternally.save()dispatches based on mode:append→ callsinsert()if supported, else raisesNotImplementedError.overwrite,error,ignore→ handled viacreate_table/create_viewwith appropriate overwrite flag._describe()includesmode.Credentials Support
credentialsparam to init signature and docstring.constring._connect()routes accordingly based on type._get_backend_name()extracts backend info for_describe().All supported
modeoptions were tested using both DuckDB and Postgres backends.credentialsparsing was tested for string, dict-with-con, and dict-with-backend formats.Checklist
jsonschema/kedro-catalog-X.XX.jsonif necessaryRELEASE.mdfile