Skip to content

Conversation

@vasil-pashov
Copy link
Collaborator

Reference Issues/PRs

Monday:

What does this implement or fix?

This provides initial implementation of merge functionality supporting only the update part of it. It supports only matching on an ordered DatetimeIndex and static schema.

The algorithm takes advantage that both the source and the target are ordered.

  1. Iterate over all slices in the index key and produce a list of object describing which of the slices can contain rows in source. This is done by performing lower_bound (binary_search) in the source index, searching for start index value stored in the slice. If the returned value is between key_start_index and key_end_index then the data segment could be affected. The complexity is O(index_row_count * log(source_row_count)). The information is stored as a pair, the index of the affected slice in the index key and the first index value from the source that falls into that slice.
  2. Only the potentially affected data keys are read.
  3. For each data key (in parallel) iterate over all index values in source that are between the first and last index values of the data key and perform lower_bound (binary search) to check if the index value from source is in the segment. If it is perform update. Complexity O(source_row_count * log(segment_size))

Next steps:
The iteration in step 3 above is row wise. This will be slow for DataFrames containing UTF string values as reading UTF strings requires holding the GIL and in general row wise iterations are not cache friendly. The reason this initial implementation uses row wise iteration is that it's easier to implement. Column wise iterations would need to either perfrom O(slice_column_count * source_row_count * log(segment_size)) or use a caching mechanism matching source row to segment row another difficulty will be related to having the on clause. With on clause we need to check the entire row (across all segments) to know if update should be performed. The long term plan is to add additional step before update_segment_inplace that will iterate over all slices and generate a list of of pairs (UPDATE/INSERT, row_in_target_segment, row_in_source).

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@vasil-pashov vasil-pashov force-pushed the vasil.pashov/implement-merge-update branch from efa409d to 33f3a23 Compare October 1, 2025 13:35
Add prune_previous, metadata, on and match_on_index to the skeleton. Imlement checks for features that are not yet implemented

Initial implementation

Set types properly

Fix tests

Fix unit tests

Enable more tests
@vasil-pashov vasil-pashov force-pushed the vasil.pashov/implement-merge-update branch from 33f3a23 to 2c55289 Compare October 1, 2025 14:17
metadata: Optional[Any] = None,
upsert: bool = False,
) -> VersionedItem:
udm, item, norm_meta = self._nvs._try_normalize(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • docs :)

def merge(
self,
symbol: str,
dataframe: NormalizableType,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter data to match update method?

)


class MergeStrategy(NamedTuple):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the expected semantics of UPDATE vs INSERT? Just thinking of how this will be used, isn't upsert always what's wanted i.e. add missing, or update what's there?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess upsert will be the most commonly used. Update and insert are more or less orthogonal and can be implemented separately. Thus we can present an option to the user to do just one operation.

"mode must be one of StagedDataFinalizeMethod.WRITE, StagedDataFinalizeMethod.APPEND, 'write', 'append'"
)

def merge(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have another method on this, vs making this a feature of update?
Update already has an upsert parameter which many would think would have upsert semantics.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I find the semantics of the upsert parameter to be a bit confusing I guess the word got overloaded too much. Technically it's possible to add some flags and have this functionality as part of the update. I think the two things are doing quite different things and the implementation will be vastly different so I'm not quite convinced it'll be for the better to mash them into one API call.

};

std::vector<SliceAffectedByMerge> slices_affected_by_merge(
const InputTensorFrame& source, std::span<const SliceAndKey> slices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need the `match_on_timeseries_index' parameter as well, as we will always need to read everything when not matching on the index

std::vector<folly::Future<folly::Unit>> merge_segments_fut;
merge_segments_fut.reserve(affected_slices.size());
for (const SliceAffectedByMerge& affected : affected_slices) {
merge_segments_fut.emplace_back(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be windowed to guarantee we don't read everything into memory at once

.thenValue([store,
&update_info](std::pair<VariantKey, SegmentInMemory>&& key_segment) {
const AtomKey& key = std::get<AtomKey>(key_segment.first);
return store->write(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about slicing? We can be inserting lots of rows, so could end up with a massive segment on disk

);
})
.thenValueInline([index, affected](VariantKey&& key) {
index->slice_and_keys[affected.slice_index].set_key(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rowcount will now be wrong in the SliceAndKey?
See above comment on slicing though, I don't think this design of modifying the existing index will work in that case, better to just build up a new index key, more like update

const auto target_index_end = target_index.end<IndexType>();
while (row_in_source < source.num_rows) {
const timestamp source_index_value = source_index[row_in_source];
if (slice_to_update.key().end_time() <= source_index_value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using atom key end times is error prone due to various bugs we've had in the past, use the segment in memory as the source of truth whenever it has already been read

}

auto target_row_it = std::lower_bound(target_index_search_start, target_index_end, source_index_value);
while (target_row_it != target_index_end && *target_row_it == source_index_value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this algorithm on a call. There are pros and cons to whatever we choose to do for index.
We should also design this with match_on_index_column=false and !on.empty() now as well, as they could both cause large changes to how this looks.

@vasil-pashov vasil-pashov marked this pull request as draft October 6, 2025 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants