-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tx/group compaction fixes #24637
base: dev
Are you sure you want to change the base?
tx/group compaction fixes #24637
Conversation
/dt |
CI test resultstest results on build#60037
test results on build#60049
test results on build#60058
|
/ci-repeat 3 |
Retry command for Build#60049please wait until all jobs are finished before running the slash command
|
A bug in 24.2.0 resulted in a situation where tx_fence batches were retained _after_ compaction while their corresponding data/commit/abort batches were compacted away. This applied to only group transactions that used tx_fence to begin the transaction. Historical context: tx_fence was considered historically as fence batches that begin a group transaction and regular data partition transaction. That changed starting 24.2.0 where a dedicated fence batch type (group_tx_fence) was used for group transaction fencing. After this buggy compaction, these uncleaned tx_fence batches are accounted as open transactions when computing max_collectible_offset thus blocking further compaction after upgrade to 24.2.x. We just ignore tx_fence batches going forward, the rationale is as follows. - Fistly they are not currently in use starting 24.2 (in favor of a dedicated group_tx_fence), anyone starting group transactions from 24.2 shouldn't see any conflicts - For sealed transactions, commit/abort/data batches were already removed if the compaction ran, so ignoring tx_fence should be the right thing to in such cases without any conflicts/correctness issues - Hypothetically if the compaction didn't run, it is still ok to ignore those batches because in group transactions commited transactions are atomically rewritten as a separate raft_data batch along with commit marker which will be applied in the stm (so no state will be lost) - Any group transaction using tx_fence likely belonged to 24.1.x which is atleast 6 months old at the time of writing, so reasonable to assume all such transactions are already sealed, especially since we started enforcing max transaction timeout of 15mins. - The only case where it could theoretically be a problem is during an upgrade from 24.1.x with an open transaction upgrading to 24.2.x (with the fix) and the transaction remaining open throughout the upgrade which then be considered aborted (if leadership is assumed on a 24.2.x broker). This is a highly unlikey scenario but the suggestion is to stop all running group transactions (kstreams) applications when doing the upgrade note: this only affects group transactions, so regular transactions that donot do offset commits as a part of transactions are safe.
This will result in hanging transactions and subsequent blocking of compaction.
.. for a given partition, to be hooked up with REST API in the next commit.
/v1/debug/producers/{namespace}/{topic}/{partition} .. includes low level debug information about producers for idempotency/transactional state.
@@ -185,6 +186,8 @@ class group_manager { | |||
|
|||
described_group describe_group(const model::ntp&, const kafka::group_id&); | |||
|
|||
partition_response describe_partition_producers(const model::ntp&); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: partition_response
is confusing as it has no context, can we create an alias for this type f.e. partition_producers
?
} | ||
response.error_code = kafka::error_code::none; | ||
// snapshot the list of groups attached to this partition | ||
fragmented_vector<std::pair<group_id, group_ptr>> groups; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunked_vector ?
|
||
absl::btree_map<model::producer_identity, model::offset> | ||
producer_to_begin; | ||
const all_txs_t& inflight_transactions() const { return _all_txs; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems not relevant for current commit ?
"[{}] group: {}, producer: {}, begin: {}", | ||
_raft->ntp(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add more context to this log line ? i.e. it is not known what the begin is
|
||
if (pid_str.empty() || epoch_str.empty() || sequence_str.empty()) { | ||
throw ss::httpd::bad_param_exception( | ||
"invalid producer_id/epoch, should be >= 0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the error message is misaligned
Retry command for Build#60058please wait until all jobs are finished before running the slash command
|
auto& tx = state.transaction; | ||
int64_t start_offset = -1; | ||
if (tx && tx->begin_offset >= model::offset{0}) { | ||
start_offset = partition->get_offset_translator_state() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible that tx->begin_offset
doesn't exist anymore/removed by retention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed the PR to be up-to-date with the changes. Hope the observations are useful.
@@ -2974,4 +2974,23 @@ ss::future<tx::errc> tx_gateway_frontend::do_delete_partition_from_tx( | |||
co_return tx::errc::none; | |||
} | |||
|
|||
ss::future<tx::errc> tx_gateway_frontend::unsafe_abort_group_transaction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment describing what is "unsafe" about this operation? what invariants will break? what semantic behavior breaks?
}, | ||
{ | ||
"name": "sequence", | ||
"in": "int", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? should be "in": "query",
} | ||
|
||
if (pid_str.empty() || epoch_str.empty() || sequence_str.empty()) { | ||
throw ss::httpd::bad_param_exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen out of the box btw. This seems to be called implicitly when handling requests https://github.com/redpanda-data/seastar/blob/09a59a23ff2740a2fa591b0e65d978ca83d2b9e3/include/seastar/http/handlers.hh#L76
auto sequence_str = request->get_query_param("sequence"); | ||
|
||
if (group_id.empty()) { | ||
throw ss::httpd::bad_param_exception("group_id cannot be empty"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this actually happen or it is guaranteed by the router that this is not empty? nit: Anyway, the right error here would be 404.
@@ -1849,3 +1849,15 @@ def get_debug_bundle_file(self, filename: str, node: MaybeNode = None): | |||
def delete_debug_bundle_file(self, filename: str, node: MaybeNode = None): | |||
path = f"debug/bundle/file/{filename}" | |||
return self._request("DELETE", path, node=node) | |||
|
|||
def unsafe_abort_group_transaction(self, group_id: str, pid: int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: def unsafe_abort_group_transaction(self, group_id: str, *, pid: int,
to force the caller to specify the param names. too many ints that are easy to reorder
dedicated fence batch type (group_tx_fence) was used for group | ||
transaction fencing. | ||
|
||
After this buggy compaction, these uncleaned tx_fence batches are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the buggy compaction fixed now? Otherwise. shouldn't we just fix the buggy compaction instead?
model::producer_identity{header.producer_id, header.producer_epoch}, | ||
header.base_offset); | ||
model::record_batch_header, kafka::group_tx::offsets_metadata) { | ||
// Transaction boundaries are determined by fence/commit or abort |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change doesn't seem to be explained by the commit message?
@@ -92,6 +92,46 @@ | |||
] | |||
} | |||
] | |||
}, | |||
{ | |||
"path": "/v1/transaction/{group_id}/unsafe_abort_group_transaction", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this looks a bit inconsistent, because group_id can be easily confused with transactional_id (when comparing with other endpoints)
cluster::get_producers_request{ntp, timeout}); | ||
if (result.error_code != cluster::tx::errc::none) { | ||
throw ss::httpd::server_error_exception(fmt::format( | ||
"Error {} processing partition state for ntp: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "processing partition state" sounds a bit ambiguous, maybe "getting producers for ntp:" instead?
const model::ntp ntp = parse_ntp_from_request(req->param); | ||
auto timeout = std::chrono::duration_cast<model::timeout_clock::duration>( | ||
10s); | ||
auto result = co_await _tx_gateway_frontend.local().get_producers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check here if the frontend is initialized (as is done for the other endpoint)?
pid = model::producer_id{parsed_pid}; | ||
} catch (const boost::bad_lexical_cast& e) { | ||
throw ss::httpd::bad_param_exception( | ||
fmt::format("invalid producer_id, should be >= 0: {}", e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I guess printing values should be more useful than printing the exception?
auto group_ntp = mapper.ntp_for(kafka::group_id{group_id}); | ||
if (!group_ntp) { | ||
throw ss::httpd::server_error_exception( | ||
"consumer_offsets topic now found"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo now -> not
producers.producers.push(std::move(producer_state)); | ||
co_await ss::coroutine::maybe_yield(); | ||
} | ||
co_return ss::json::json_return_type(std::move(producers)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ss::json::stream_range_as_array
?
return _inflight_requests; | ||
} | ||
|
||
const request_queue& fnished_requests() const { return _finished_requests; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: fnished -> finished
Check individual commit message for details
Main changes
Backports Required
Release Notes
Bug Fixes