Skip to content

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Sep 24, 2025

  • throttle a log line for Glue
  • trace log payloads on error in the rest_client

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Improvements

  • Redpanda will now log trace-level messages about failed requests to the Iceberg REST catalog.
  • Redpanda will now log less frequently to warn about using the default Iceberg partition spec with ÅWS Glue.

@Copilot Copilot AI review requested due to automatic review settings September 24, 2025 19:22
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves logging in the Iceberg REST client for better debugging and operational visibility. The changes add trace-level logging for failed REST requests and throttle frequent warning messages about default partition specs with AWS Glue.

  • Added trace-level logging with request payloads when REST client operations fail
  • Implemented rate limiting for AWS Glue default partition spec warnings to reduce log noise
  • Enhanced error diagnostics by logging request details on failure

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/v/iceberg/rest_client/catalog_client.cc Added trace logging for failed requests with payloads and improved error handling structure
src/v/iceberg/rest_client/BUILD Added dependency on iobuf_parser for payload logging functionality
src/v/datalake/coordinator/coordinator.cc Implemented rate limiting for AWS Glue partition spec warning messages

Comment on lines 56 to 66
template<typename T>
void maybe_log_payload_as_json(
ss::logger& l, ss::log_level lvl, std::string_view msg, const T& payload) {
if (!l.is_enabled(lvl)) {
return;
}
auto buf = serialize_payload_as_json(payload);
iobuf_parser p(std::move(buf));
vlogl(iceberg::log, lvl, "{}: {}", msg, p.read_string_safe(4_KiB));
}
Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should have a docstring explaining its purpose, parameters, and the 4_KiB limit rationale. It's unclear why 4_KiB was chosen as the safe read limit for payload logging.

Copilot uses AI. Check for mistakes.

}
auto buf = serialize_payload_as_json(payload);
iobuf_parser p(std::move(buf));
vlogl(iceberg::log, lvl, "{}: {}", msg, p.read_string_safe(4_KiB));
Copy link
Preview

Copilot AI Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4_KiB limit should be defined as a named constant instead of a magic number. Consider defining it as constexpr auto max_payload_log_size = 4_KiB; at the top of the file.

Copilot uses AI. Check for mistakes.

@andrwng andrwng requested review from wdberkeley, dotnwat and nvartolomei and removed request for wdberkeley September 24, 2025 19:23
wdberkeley
wdberkeley previously approved these changes Sep 24, 2025
dotnwat
dotnwat previously approved these changes Sep 24, 2025
@dotnwat dotnwat enabled auto-merge September 24, 2025 21:12
@andrwng andrwng dismissed stale reviews from dotnwat and wdberkeley via 138231a September 24, 2025 21:42
@andrwng
Copy link
Contributor Author

andrwng commented Sep 24, 2025

Force pushed a non-functional bug (s/iceberg::logger/l)

dotnwat
dotnwat previously approved these changes Sep 24, 2025
wdberkeley
wdberkeley previously approved these changes Sep 24, 2025
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Sep 24, 2025

Retry command for Build#72892

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/throttling_test.py::DatalakeThrottlingTest.test_backlog_metric@{"catalog_type":"nessie","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_iceberg_partition_key_file_location@{"catalog_type":"nessie","cloud_storage_type":1,"custom_partition_spec":null}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark"}
tests/rptest/tests/datalake/3rdparty_maintenance_test.py::Datalake3rdPartyMaintenanceTest.test_e2e_basic@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_avro_schema@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_e2e_basic@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_iceberg_partition_key_file_location@{"catalog_type":"nessie","cloud_storage_type":1,"custom_partition_spec":"(timestamp_us)"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_json_schema_unicode@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino"}
tests/rptest/tests/datalake/datalake_dlq_test.py::DatalakeDLQMultinodeTest.test_dlq_table_with_multiple_nodes@{"catalog_type":"nessie","cloud_storage_type":1}
tests/rptest/tests/datalake/cluster_restore_test.py::DatalakeClusterRestoreTest.test_slow_tiered_storage_dupe_records@{"catalog_type":"nessie","cloud_storage_type":1}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"illegal promotion int->string"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark"}
tests/rptest/tests/datalake/custom_partitioning_test.py::DatalakeCustomPartitioningTest.test_spec_evolution@{"catalog_type":"nessie","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_dlq_test.py::DatalakeDLQTest.test_dlq_table_for_invalid_records@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_e2e_basic@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_json_schema_unicode@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"illegal promotion int->string"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/3rdparty_maintenance_test.py::Datalake3rdPartyMaintenanceTest.test_e2e_basic@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/throttling_test.py::DatalakeThrottlingTest.test_basic_throttling@{"catalog_type":"nessie","cloud_storage_type":1}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_avro_schema@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"spark"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"reorder_columns"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino"}
tests/rptest/tests/datalake/compaction_test.py::CompactionGapsTest.test_translation_no_gaps@{"catalog_type":"nessie","cloud_storage_type":1}
tests/rptest/tests/datalake/cluster_restore_test.py::DatalakeClusterRestoreTest.test_slow_tiered_storage_dlq@{"catalog_type":"nessie","cloud_storage_type":1}
tests/rptest/tests/datalake/delayed_translation_test.py::DatalakeDelayedTranslationTest.test_basic@{"catalog_type":"nessie","cloud_storage_type":1,"query_engine":"trino"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_illegal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"trino","test_case":"illegal promotion int->string"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"spark","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_legal_schema_evolution@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"trino","test_case":"promote_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto2","query_engine":"spark","test_case":"drop_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_old_schema_writer@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"proto3","query_engine":"trino","test_case":"add_column"}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_reorder_columns@{"catalog_type":"nessie","cloud_storage_type":1,"produce_mode":"avro","query_engine":"spark"}
tests/rptest/tests/datalake/compaction_test.py::CompactionTest.test_compaction@{"catalog_type":"nessie","cloud_storage_type":1}

When there is an error returned from the REST catalog, we have very
little context about what we sent. This commit addresses this by
logging the logical contents of the request as JSON.

I considered logging at a lower level, and logging the entire HTTP
request payload (e.g. in perform_request() after signing and such), but
opted to place the at the callsite because the context we have there is
already const and available to be logged conditionally (vs in
perform_request(), the request is moved, making conditional logging
trickier).
@andrwng andrwng dismissed stale reviews from wdberkeley and dotnwat via e315237 September 25, 2025 02:04
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#72910
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ClusterRateQuotaTest test_client_group_consume_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72910#01997ebc-1705-454d-8319-0f33fe1e4d91 FLAKY 16/21 upstream reliability is '94.62254395036194'. current run reliability is '76.19047619047619'. drift is 18.43207 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest test_client_group_consume_rate_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72910#01997ebe-dfe3-414c-bfc4-375b95676b41 FLAKY 14/21 upstream reliability is '82.2841726618705'. current run reliability is '66.66666666666666'. drift is 15.61751 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_group_consume_rate_throttle_mechanism
ClusterRateQuotaTest test_client_response_and_produce_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72910#01997ebc-1708-4400-86a5-65c4d4e1fee8 FLAKY 16/21 upstream reliability is '80.43087971274686'. current run reliability is '76.19047619047619'. drift is 4.2404 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_and_produce_throttle_mechanism
ClusterRateQuotaTest test_client_response_throttle_mechanism null integration https://buildkite.com/redpanda/redpanda/builds/72910#01997ebc-1709-4612-b367-b2b5d6e48726 FLAKY 12/21 upstream reliability is '81.93493150684932'. current run reliability is '57.14285714285714'. drift is 24.79207 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ClusterRateQuotaTest&test_method=test_client_response_throttle_mechanism
DatalakeCustomPartitioningTest test_many_partitions {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/72910#01997ebc-1705-454d-8319-0f33fe1e4d91 FLAKY 12/21 upstream reliability is '100.0'. current run reliability is '57.14285714285714'. drift is 42.85714 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeCustomPartitioningTest&test_method=test_many_partitions
RecoveryModeTest test_rolling_restart null integration https://buildkite.com/redpanda/redpanda/builds/72910#01997ebc-1705-454d-8319-0f33fe1e4d91 FLAKY 15/21 upstream reliability is '94.51327433628319'. current run reliability is '71.42857142857143'. drift is 23.0847 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RecoveryModeTest&test_method=test_rolling_restart

@dotnwat dotnwat merged commit d02cda6 into redpanda-data:dev Sep 25, 2025
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27715-v25.1.x-769 remotes/upstream/v25.1.x
git cherry-pick -x a493a7fc7c e315237b74

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v25.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27715-v25.2.x-965 remotes/upstream/v25.2.x
git cherry-pick -x a493a7fc7c e315237b74

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants