Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datalake/coordinator: initial coordinator base #23712

Merged
merged 7 commits into from
Oct 10, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Oct 9, 2024

Based on #23714

Adds some initial structure for the coordinator. This commit just adds
wrapper around a couple of calls to the underlying STM, so that it can
be plugged into the frontend.

Later commits will have the coordinator reconcile the STM with the
Iceberg catalog, perform local/Raft snapshots, etc.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

@andrwng andrwng force-pushed the datalake-coordinator-base branch from eb54b88 to 2115f9d Compare October 9, 2024 22:26
@andrwng andrwng marked this pull request as ready for review October 9, 2024 22:26
@andrwng andrwng force-pushed the datalake-coordinator-base branch from 2115f9d to 7637549 Compare October 9, 2024 22:37
Pulls some utilities out of a test to make it easier to reuse building
state updates.
For use by the upcoming datalake coordinator STM to track files that
need to be committed to Iceberg.
Adds topics_state to the STM, and updates the apply method to
appropriately update it.
@andrwng andrwng requested review from mmaslankaprv, bharathv and dotnwat and removed request for mmaslankaprv October 10, 2024 05:40
@andrwng andrwng force-pushed the datalake-coordinator-base branch from 7637549 to 7ae61d7 Compare October 10, 2024 16:20
@andrwng andrwng requested a review from mmaslankaprv October 10, 2024 16:24
}
return update;
}

bool add_files_update::can_apply(const topics_state& state) {
checked<std::nullopt_t, stm_update_error>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be be useful to define a "no error state" in stm_update_error

enum class stm_update_error {
    none
    failed,
};

that I think removes the need for check<nullopt_t...>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been preferring to use errors for only error types, and use result/checked to signal expected vs not expected outcomes. I kind of like how error handling is uniform now with has_error(), instead of having to guess about optionality or special 0 codes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok, fwiw https://en.cppreference.com/w/cpp/error/error_code/operator_bool also defines using a none equivalent with 0 value, so thats somewhat of a standard approach I guess, anyway I'm not too strong about it.

Copy link
Contributor Author

@andrwng andrwng Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, IMO C++'s "best practices" for error handling are a bit error prone, unless we standardize our codebase to One True Way, it's way too easy to mess up..

src/v/datalake/coordinator/state_machine.cc Show resolved Hide resolved
src/v/datalake/coordinator/state_machine.cc Outdated Show resolved Hide resolved
src/v/datalake/coordinator/state_machine.cc Outdated Show resolved Hide resolved
src/v/datalake/coordinator/state_machine.cc Show resolved Hide resolved
src/v/datalake/coordinator/coordinator.cc Outdated Show resolved Hide resolved
"Rejecting request to add files for {}: {}",
tp,
update_res.error());
co_return errc::stm_apply_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this is a validation error? (since its pre replication).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the caller's perspective they should be the same thing, no? the request does not apply cleanly to the stm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking from a debugging perspective, if the request was rejected due to a race or if was legit validation, nbd, feel free to keep it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it for now, i think the log is enough, unless these get serialized in the response and the only logs we have are the worker logs

@andrwng andrwng force-pushed the datalake-coordinator-base branch 2 times, most recently from 46340ac to 89e151f Compare October 10, 2024 17:28
In some upcoming changes, we'll expect the STM to be an implementation
detail, and for callers to go through a separate coordinator
abstraction. To that end, this moves most STM methods out of the public
interface and exposes some internals for use in the upcoming
coordinator.
Adds some initial structure for the coordinator. This commit just adds
wrapper around a couple of calls to the underlying STM, so that it can
be plugged into the frontend.

Later commits will have the coordinator reconcile the STM with the
Iceberg catalog, perform local/Raft snapshots, etc.
@andrwng andrwng force-pushed the datalake-coordinator-base branch from 89e151f to 9ca5fac Compare October 10, 2024 17:29
@andrwng andrwng requested a review from bharathv October 10, 2024 17:31
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 10, 2024

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 10, 2024

Retry command for Build#56296

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_storage_timing_stress_test.py::CloudStorageTimingStressTest.test_cloud_storage@{"cleanup_policy":"delete"}
tests/rptest/tests/cloud_storage_timing_stress_test.py::CloudStorageTimingStressTest.test_cloud_storage_with_partition_moves@{"cleanup_policy":"delete"}

}

checked<ss::gate::holder, coordinator::errc> coordinator::maybe_gate() {
ss::gate::holder h;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h is unused? intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! unintentional, may fix in a follow up if this PR can get merged

@dotnwat dotnwat merged commit a599c84 into redpanda-data:dev Oct 10, 2024
14 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants