Merge update initial implementation #2684

vasil-pashov · 2025-10-01T13:30:10Z

Reference Issues/PRs

Monday:

What does this implement or fix?

This provides initial implementation of merge functionality supporting only the update part of it. It supports only matching on an ordered DatetimeIndex and static schema.

The algorithm takes advantage that both the source and the target are ordered.

Iterate over all slices in the index key and produce a list of object describing which of the slices can contain rows in source. This is done by performing lower_bound (binary_search) in the source index, searching for start index value stored in the slice. If the returned value is between key_start_index and key_end_index then the data segment could be affected. The complexity is O(index_row_count * log(source_row_count)). The information is stored as a pair, the index of the affected slice in the index key and the first index value from the source that falls into that slice.
Only the potentially affected data keys are read.
For each data key (in parallel) iterate over all index values in source that are between the first and last index values of the data key and perform lower_bound (binary search) to check if the index value from source is in the segment. If it is perform update. Complexity O(source_row_count * log(segment_size))

Next steps:
The iteration in step 3 above is row wise. This will be slow for DataFrames containing UTF string values as reading UTF strings requires holding the GIL and in general row wise iterations are not cache friendly. The reason this initial implementation uses row wise iteration is that it's easier to implement. Column wise iterations would need to either perfrom O(slice_column_count * source_row_count * log(segment_size)) or use a caching mechanism matching source row to segment row another difficulty will be related to having the on clause. With on clause we need to check the entire row (across all segments) to know if update should be performed. The long term plan is to add additional step before update_segment_inplace that will iterate over all slices and generate a list of of pairs (UPDATE/INSERT, row_in_target_segment, row_in_source).

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

Add prune_previous, metadata, on and match_on_index to the skeleton. Imlement checks for features that are not yet implemented Initial implementation Set types properly Fix tests Fix unit tests Enable more tests

jamesblackburn · 2025-10-02T08:17:40Z

python/arcticdb/version_store/library.py

+        metadata: Optional[Any] = None,
+        upsert: bool = False,
+    ) -> VersionedItem:
+        udm, item, norm_meta = self._nvs._try_normalize(


jamesblackburn · 2025-10-02T08:21:53Z

python/arcticdb/version_store/library.py

+    def merge(
+        self,
+        symbol: str,
+        dataframe: NormalizableType,


Parameter data to match update method?

jamesblackburn · 2025-10-02T08:23:08Z

python/arcticdb/version_store/library.py

        )


+class MergeStrategy(NamedTuple):


What are the expected semantics of UPDATE vs INSERT? Just thinking of how this will be used, isn't upsert always what's wanted i.e. add missing, or update what's there?

I guess upsert will be the most commonly used. Update and insert are more or less orthogonal and can be implemented separately. Thus we can present an option to the user to do just one operation.

jamesblackburn · 2025-10-02T08:35:29Z

python/arcticdb/version_store/library.py

                "mode must be one of StagedDataFinalizeMethod.WRITE, StagedDataFinalizeMethod.APPEND, 'write', 'append'"
            )
+
+    def merge(


Should we have another method on this, vs making this a feature of update?
Update already has an upsert parameter which many would think would have upsert semantics.

Yeah, I find the semantics of the upsert parameter to be a bit confusing I guess the word got overloaded too much. Technically it's possible to add some flags and have this functionality as part of the update. I think the two things are doing quite different things and the implementation will be vastly different so I'm not quite convinced it'll be for the better to mash them into one API call.

alexowens90 · 2025-10-02T12:41:37Z