`datalake`: relax `translation_stm::max_collectible_offset()` value (and add `compaction_test.py`) #24610

WillemKauf · 2024-12-18T22:07:01Z

Previously, the datalake::translation::translation_stm would return its max collectible as the following:

redpanda/src/v/datalake/translation/state_machine.cc

Lines 112 to 122 in 925707c

    
           model::offset translation_stm::max_collectible_offset() { 
        
               if (!_raft->log_config().iceberg_enabled()) { 
        
                   return model::offset::max(); 
        
               } 
        
               // if offset is not initialized, do not attempt translation. 
        
               if (_highest_translated_offset == kafka::offset{}) { 
        
                   return model::offset{}; 
        
               } 
        
               return _raft->log()->to_log_offset( 
        
                 kafka::offset_cast(_highest_translated_offset)); 
        
           }

This offset translation leads to an overly restrictive condition for the max collectible offset, due to the fact that it is translation batch unaware.

Here, the utility function highest_log_offset_below_next() is added, which returns the "equivalent" translated log offset for a given kafka offset, taking into account translation batches (which don't need to be translated, and thus shouldn't restrict the max collectible offset).

translation_stm::max_collectible_offset() now uses this function to relax its returned offset.

Additionally, a new test for compaction with an Iceberg enabled topic is added to datalake/compaction_test.py, with some enhancements to the datalake_verifier service to make it compaction aware.

Backports Required

Release Notes

Improvements

Fixes an overly restrictive condition for retention in Iceberg-enabled topics.

WillemKauf · 2024-12-18T22:23:46Z

tests/rptest/tests/datalake/datalake_verifier.py

@@ -24,22 +24,24 @@

 class DatalakeVerifier():
    """
-     Verifier that does the verification of the data in the redpanda Iceberg table. 
-     The verifier consumes offsets from specified topic and verifies it the data 
+     Verifier that does the verification of the data in the redpanda Iceberg table.


Trailing whitespace removal

vbotbuildovich · 2024-12-19T00:42:10Z

Retry command for Build#59935

please wait until all jobs are finished before running the slash command



/ci-repeat 1
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_ChunkedRead == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[2,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/write_caching_fi_test.py::WriteCachingFailureInjectionTest.test_crash_all
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[2,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True)"}}
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all@{"use_transactions":false}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_topic_lifecycle@{"cloud_storage_type":1,"filesystem_catalog_mode":false}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_Timequery == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"path"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True)"}}
tests/rptest/tests/datalake/datalake_e2e_test.py::DatalakeE2ETests.test_topic_lifecycle@{"cloud_storage_type":1,"filesystem_catalog_mode":true}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, TS_Timequery == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"path"],"test_case":{"name":"(TS_Read == True, TS_TxRangeMaterialized == True, SpilloverManifestUploaded == True)"}}
tests/rptest/tests/tiered_storage_model_test.py::TieredStorageTest.test_tiered_storage@{"cloud_storage_type_and_url_style":[1,"virtual_host"],"test_case":{"name":"(TS_Read == True, AdjacentSegmentMergerReupload == True)"}}

vbotbuildovich · 2024-12-19T01:44:33Z

CI test results

test results on build#59935

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2	FAIL	0/6
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1	FAIL	0/6
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c806-4851-8e40-029c7bdf36d7	FAIL	0/6
rptest.tests.datalake.datalake_e2e_test.DatalakeE2ETests.test_topic_lifecycle.cloud_storage_type=CloudStorageType.S3.filesystem_catalog_mode=True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1	FAIL	0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064	FLAKY	5/6
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c807-4275-8982-8723111a2347	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.ABS.2.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f6-49d5-ae5a-d5eb3497ad6e	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f2-467e-a057-0e4a790311ae	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c806-4851-8e40-029c7bdf36d7	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.AdjacentSegmentMergerReupload==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f6-49d5-ae5a-d5eb3497ad6e	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c807-4275-8982-8723111a2347	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_ChunkedRead==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f2-467e-a057-0e4a790311ae	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c806-4851-8e40-029c7bdf36d7	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_Timequery==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c807-4275-8982-8723111a2347	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064	FAIL	0/1
rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.virtual_host.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f6-49d5-ae5a-d5eb3497ad6e	FAIL	0/1
rptest.tests.write_caching_fi_e2e_test.WriteCachingFailureInjectionE2ETest.test_crash_all.use_transactions=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c804-449e-983b-f6040b48eed2	FAIL	0/1
rptest.tests.write_caching_fi_e2e_test.WriteCachingFailureInjectionE2ETest.test_crash_all.use_transactions=False	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f4-48f5-b848-e36477ea95e1	FAIL	0/1
rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_crash_all	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc18-c803-4c2f-9cd0-86f9a8dd5064	FAIL	0/1
rptest.tests.write_caching_fi_test.WriteCachingFailureInjectionTest.test_crash_all	ducktape	https://buildkite.com/redpanda/redpanda/builds/59935#0193dc1c-24f3-40db-9e3c-b72686edcfd1	FAIL	0/1

test results on build#60023

test_id	test_kind	job_url	test_status	passed
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60023#0193e62a-cfd5-4342-9d5c-bd82d737eba0	FLAKY	2/6

test results on build#60079

test_id	test_kind	job_url	test_status	passed
rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True.truncate_point=start_offset	ducktape	https://buildkite.com/redpanda/redpanda/builds/60079#0193f494-6494-4f55-ac0e-69cf1956752e	FLAKY	5/6

test results on build#60368

test_id	test_kind	job_url	test_status	passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60368#01944294-d5f0-4c98-8132-4655bda7db82	FLAKY	5/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60368#019442af-789d-4ffa-a231-a3a2c2539ba9	FLAKY	1/6
rptest.transactions.stream_verifier_test.StreamVerifierTest.test_simple_produce_consume_txn_with_add_node	ducktape	https://buildkite.com/redpanda/redpanda/builds/60368#01944294-d5ed-4718-a54f-7ac9bb15fdcc	FLAKY	5/6
storage_e2e_single_thread_rpunit.storage_e2e_single_thread_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60368#01944252-4124-4c76-85c2-f3165a0ba962	FLAKY	1/2

test results on build#60544

test_id	test_kind	job_url	test_status	passed
rm_stm_tests_rpunit.rm_stm_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60544#01944d38-a71c-4bb4-86a4-b75b1322f1b0	FLAKY	1/2

test results on build#60563

test_id	test_kind	job_url	test_status	passed
rm_stm_tests_rpunit.rm_stm_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60563#01944e3e-7668-4063-8ce8-c9bff6ddcdc7	FLAKY	1/2
rptest.tests.datalake.simple_connect_test.RedpandaConnectIcebergTest.test_translating_avro_serialized_records.cloud_storage_type=CloudStorageType.S3	ducktape	https://buildkite.com/redpanda/redpanda/builds/60563#01944e84-4faa-45d6-848a-06d1d9cef224	FLAKY	5/6
rptest.tests.partition_reassignments_test.PartitionReassignmentsTest.test_reassignments_kafka_cli	ducktape	https://buildkite.com/redpanda/redpanda/builds/60563#01944e84-4faa-45d6-848a-06d1d9cef224	FLAKY	1/6

test results on build#60597

test_id	test_kind	job_url	test_status	passed
rm_stm_tests_rpunit.rm_stm_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60597#019450cc-9bd3-4e5f-afb0-e1ec69fbb762	FLAKY	1/2
rm_stm_tests_rpunit.rm_stm_tests_rpunit	unit	https://buildkite.com/redpanda/redpanda/builds/60597#019450cc-9cc5-4fda-8817-501bafa1c3fe	FLAKY	1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade	ducktape	https://buildkite.com/redpanda/redpanda/builds/60597#01945126-d4c5-4f13-bf17-1e1af69bd8cc	FLAKY	5/6

WillemKauf · 2024-12-19T04:28:05Z

Lot of KgoVerifierProducer failures, panic: Out of order offset 0 (vs 0 20000).

Not sure if this is another KgoVerifierProducer issue or if something else has been broken.

The only related change I can see in KgoVerifier was this, in which pw.validOffsets.Insert() is now called under a lock in new function OnAcked (but CI must have ran for this change many times before seeing these failures, so I am uncertain)

EDIT: Probably just because of the oneshot() changes I made. Reverted.

WillemKauf · 2024-12-20T14:33:25Z

Force push to:

Revert changes to kgo_verifier_service::oneshot().

vbotbuildovich · 2024-12-20T20:42:39Z

Retry command for Build#60016

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/datalake/compaction_test.py::CompactionTest.test_compaction@{"cloud_storage_type":1}

WillemKauf · 2024-12-20T21:10:07Z

Force push to:

Change compaction wait condition in compaction_test.py. Translation seems to slow the compaction process down quite a bit.

src/v/datalake/translation/tests/translated_log_offset_test.cc

WillemKauf · 2024-12-23T16:21:32Z

Force push to:

Add two new tests to translated_log_offset_test.cc
Remove early return in get_translated_log_offset() to correct behavior for the edge case of kafka::offset{}
Add comment to get_translated_log_offset() declaration about its use.

src/v/datalake/translation/utils.cc

bharathv

sorry for the delay here.

src/v/datalake/translation/state_machine.h

tests/rptest/tests/datalake/datalake_verifier.py

bharathv · 2025-01-06T21:45:45Z

tests/rptest/tests/datalake/datalake_verifier.py

@@ -24,22 +24,24 @@



Compaction is expected to block until translation happens. What additional coverage does verification with a compacted log with a fully translated iceberg table provide?

Compaction is expected to block until translation happens

What additional coverage does verification with a compacted log with a fully translated iceberg table provide?

That the aforementioned expectation is true (i.e Iceberg table is fully translated, log is fully compacted).

Do you think there is other verification that should be added here?

That the aforementioned expectation is true (i.e Iceberg table is fully translated, log is fully compacted).

Correct me if I'm wrong but the verifier in the current form can also succeed if the topic got translated from a compacted log (hypothetically if the code violated the max_collectible_offset invariant), no?

Yes, I suppose that is true.

In its current form, we handle the cases where the iceberg table has as much or more information than the log, as we assume that it was translated before compaction of the log took place/didn't take place, but this doesn't verify that the case you described didn't occur.

Thanks, so I was wondering if we should instead test enforcement of max_collectible_offset which is a more critical invariant and leave the verifier as it is today.

I agree that perhaps we haven't covered the case where the max_collectible_offset invariant is violated, but I do think that the existing changes to the verifier are helpful to at least ensure the case in which it ISN'T violated is correct

I'll try to think of ways to better cover the critical invariant in a follow up PR, if that works for you.

bharathv · 2025-01-06T21:54:25Z

tests/rptest/tests/datalake/compaction_test.py

@@ -118,3 +120,126 @@ def test_translation_no_gaps(self, cloud_storage_type):
                              include_query_engines=[QueryEngineType.TRINO
                                                     ]) as dl:
            self.do_test_no_gaps(dl)
+
+
+class CompactionTest(RedpandaTest):


One of the motivations for removing offset translation was to use iceberg enabled topics with read replicas/topic recovery, wondering if its worth adding an e2e test for it.

Would you like to see a test added in this PR, or as a follow up PR?

Follow up PR is also fine. (just want to ensure nothing else is broken, other than offset translation before we declare it as working).

FWIW read replicas or topic recovery don't appear to be addressed by this PR.

Read replica translators won't be able to perform offset translation on anything, and topic recovery will likely require changes to what revisions and topic overrides get passed to the coordinators

I don't believe either are in scope of this work (which is really just to unblock compaction IIUC)

Good point about topic recovery needing more work. I think @~bashtanov is doing some work here in the migrations land.

Read replica translators won't be able to perform offset translation on anything,

whats the use case here with read replicas though? IIRC its about being able to (sql) query an iceberg enabled RRR topic, thats it? RRR topic itself cannot do any translation locally.

That's right. But as is, RRR topics won't be able to call the to_log_offset() that's done in the translation path. I think it's reasonable to have the translator skip the offset-translation and pass a nullopt to the STM as the log offset.

src/v/datalake/translation/partition_translator.cc

src/v/datalake/translation/tests/state_machine_test.cc

WillemKauf · 2025-01-07T15:42:19Z

Force push to:

Rebase to upstream/dev and fix merge conflicts
Rename new_log_translated_offset -> new_translated_log_offset in state_machine.cc/.h
Add comment in datalake_verifier.py around _expected_compacted_keys
Move changes for state_machine_test.cc to proper commit

WillemKauf · 2025-01-07T19:44:58Z

Force push to:

Rebase to upstream/dev and fix merge conflicts

bharathv · 2025-01-07T22:14:59Z

tests/rptest/tests/datalake/datalake_verifier.py

@@ -24,22 +24,24 @@



Thanks, so I was wondering if we should instead test enforcement of max_collectible_offset which is a more critical invariant and leave the verifier as it is today.

src/v/datalake/translation/partition_translator.cc

src/v/datalake/translation/state_machine.h

andrwng · 2025-01-07T21:58:29Z

src/v/datalake/translation/utils.h

+model::offset
+get_translated_log_offset(ss::shared_ptr<storage::log> log, kafka::offset o);


Given it may be confusing to refer to "translation" here in multiple contexts (i'm actually not sure if you mean it as offset-translated or datalake-translated), it may be more self-descriptive if this were named highest_log_offset_below(kafka::offset), where the translator would pass in kafka::next_offset(max_translated_offset)

Or consider the name highest_log_offset_below_next()

Being able to call this function in translation_stm::max_collectible_offset() like

return get_translated_log_offset(_raft->log(), _highest_translated_offset);

feels better than having to manipulate the passed offset outside the function before calling it each time.

If you feel strongly about this I can change the name- it is unfortunate that "translation" can mean two different things in this context but I hope the code comments are descriptive enough?

I feel strongly about the name -- we already have some many named offsets here and there, this one doesn't seem so pivotal that it needs a special name. My vote is for highest_log_offset_below_next()

andrwng · 2025-01-07T22:05:51Z

tests/rptest/tests/datalake/compaction_test.py

@@ -118,3 +120,126 @@ def test_translation_no_gaps(self, cloud_storage_type):
                              include_query_engines=[QueryEngineType.TRINO
                                                     ]) as dl:
            self.do_test_no_gaps(dl)
+
+
+class CompactionTest(RedpandaTest):


FWIW read replicas or topic recovery don't appear to be addressed by this PR.

Read replica translators won't be able to perform offset translation on anything, and topic recovery will likely require changes to what revisions and topic overrides get passed to the coordinators

I don't believe either are in scope of this work (which is really just to unblock compaction IIUC)

tests/rptest/tests/datalake/datalake_verifier.py

andrwng · 2025-01-07T22:11:49Z

src/v/datalake/translation/tests/translated_log_offset_test.cc

+bool check_translated_log_offset(
+  ss::shared_ptr<storage::log> log,
+  kafka::offset translated_offset,
+  model::offset expected_offset) {
+    auto translated_log_offset
+      = datalake::translation::get_translated_log_offset(
+        log, translated_offset);
+    return expected_offset == translated_log_offset;
+}


nit: this feels like quite a lot of testing for something that is ultimately just calling into the offset translator, and so this feels like we're just testing the offset translator. Wondering if we can instead write tests that check the max collectible offset? Plus, if you go the highest_log_offset_below() route, this also all changes

Yeah, ultimately this is more of an offset translation test, but I wanted to have a set of very illustrative unit tests that only tested this mechanism, and not the state machine as a whole.

src/v/datalake/translation/utils.h

src/v/datalake/translation/state_machine.cc

WillemKauf · 2025-01-09T21:31:24Z

Force push to:

Remove persistence of _highest_translated_log_offset in datalake/translation structures
Directly call into get_translated_log_offset() within translation_stm::max_collectible_offset()
Add check for RRR in translation_stm::max_collectible_offset()

andrwng · 2025-01-09T22:14:58Z

Add check for RRR in translation_stm::max_collectible_offset()

Mind removing this from this PR and following up with a test in a separate PR? IMO this one here should be focused on unblocking compaction

WillemKauf · 2025-01-09T22:38:47Z

Force push to:

Rename get_translated_log_offset() -> highest_log_offset_below_next()
Remove commit Add check for RRR in translation_stm::max_collectible_offset()
Refactor code comment text width in datalake_verifier.py

And most importantly, add the function `highest_log_offset_below_next()`. This function will be used to compute the appropriate highest translated log offset for a given translated kafka offset while taking into account translator batches. This will allow us to be less pessimistic about the `max_collectible_offset` returned by the `translation_stm` in the future.

WillemKauf · 2025-01-10T13:59:47Z

Force push to:

Rebase to upstream/dev to fix linter CI issues

This will return a less restrictive value for `translation_stm::max_collectible_offset()`.

By handling gaps in offsets and recording seen keys, we can validate the correctness of a compacted log that has been translated (fully) into an iceberg table.

Adds a new `test_compaction` test, which uses the `KgoVerifierSeqConsumer` to validate a fully compacted log, along with the `datalake_verifier` service to validate the Iceberg table. Also moves the contents of `compaction_gaps_test.py` into `compaction_test.py`.

WillemKauf · 2025-01-10T15:12:23Z

Force push to:

Fix bazel build deps

bharathv

lgtm , cover letter needs to be updated to highest_log_offset_below_next ?

dotnwat · 2025-01-14T02:09:21Z

src/v/datalake/translation/utils.h

+// Returns the equivalent log offset which can be considered translated by the
+// datalake subsystem, while taking into account translator batch types, for a
+// given kafka offset.
+//
+// Note that the provided kafka::offset o MUST be a valid offset, i.e one that
+// has been produced to the log. This function will always return a value, and
+// its correctness depends on the validity of the input offset.
+//
+// For example, in the following situation:
+// Kaf offsets: [O]   .   .    .     [NKO]
+// Log offsets: [K]  [C] [C] [C/TLO] [NKO]
+// where O is the input offset, K is the last kafka record, C is a translator
+// (Config) batch, TLO is the translated log offset, and NKO is the next
+// expected kafka record. We should expect TLO to be equal to the offset of the
+// last configuration batch before the next kafka record.
+model::offset highest_log_offset_below_next(
+  ss::shared_ptr<storage::log> log, kafka::offset o);


WillemKauf requested a review from andrwng December 18, 2024 22:07

github-actions bot added area/build area/redpanda labels Dec 18, 2024

WillemKauf force-pushed the datalake_translator_offset_fix branch from 5db3f18 to 6c113d7 Compare December 18, 2024 22:09

WillemKauf commented Dec 18, 2024

View reviewed changes

WillemKauf force-pushed the datalake_translator_offset_fix branch from 6c113d7 to 0e1a24c Compare December 20, 2024 14:32

WillemKauf force-pushed the datalake_translator_offset_fix branch from 0e1a24c to 2fe6c55 Compare December 20, 2024 15:05

WillemKauf force-pushed the datalake_translator_offset_fix branch from 2fe6c55 to b01095c Compare December 20, 2024 21:09

WillemKauf requested review from bharathv and mmaslankaprv December 21, 2024 00:45

mmaslankaprv reviewed Dec 23, 2024

View reviewed changes

src/v/datalake/translation/tests/translated_log_offset_test.cc Show resolved Hide resolved

WillemKauf force-pushed the datalake_translator_offset_fix branch from b01095c to 4bd7693 Compare December 23, 2024 16:19

WillemKauf requested a review from mmaslankaprv January 2, 2025 17:07

bharathv reviewed Jan 4, 2025

View reviewed changes

src/v/datalake/translation/utils.cc Outdated Show resolved Hide resolved

bharathv reviewed Jan 6, 2025

View reviewed changes

WillemKauf force-pushed the datalake_translator_offset_fix branch from 4bd7693 to d1ef28c Compare January 7, 2025 15:40

WillemKauf force-pushed the datalake_translator_offset_fix branch from d1ef28c to 7d564a7 Compare January 7, 2025 19:44

bharathv previously approved these changes Jan 7, 2025

View reviewed changes

andrwng reviewed Jan 7, 2025

View reviewed changes

andrwng reviewed Jan 8, 2025

View reviewed changes

src/v/datalake/translation/state_machine.cc Show resolved Hide resolved

WillemKauf dismissed bharathv’s stale review via af320da January 9, 2025 21:30

WillemKauf force-pushed the datalake_translator_offset_fix branch from 7d564a7 to af320da Compare January 9, 2025 21:30

WillemKauf changed the title ~~datalake: remove offset translation from translation_stm (and add compaction_test.py)~~ datalake: relax translation_stm::max_collectible_offset() value (and add compaction_test.py) Jan 9, 2025

WillemKauf force-pushed the datalake_translator_offset_fix branch from af320da to 5dc456c Compare January 9, 2025 22:37

WillemKauf force-pushed the datalake_translator_offset_fix branch from 5dc456c to 9db2f43 Compare January 10, 2025 13:59

WillemKauf added 3 commits January 10, 2025 10:11

datalake: use highest_log_offset_below_next() in translation_stm

f39b05f

This will return a less restrictive value for `translation_stm::max_collectible_offset()`.

rptest: make datalake_verifier compaction aware

5a65d4e

By handling gaps in offsets and recording seen keys, we can validate the correctness of a compacted log that has been translated (fully) into an iceberg table.

WillemKauf force-pushed the datalake_translator_offset_fix branch from 9db2f43 to 3862ad0 Compare January 10, 2025 15:12

WillemKauf requested review from andrwng and bharathv January 10, 2025 19:48

andrwng approved these changes Jan 10, 2025

View reviewed changes

bharathv approved these changes Jan 10, 2025

View reviewed changes

WillemKauf merged commit 205a168 into redpanda-data:dev Jan 13, 2025
19 checks passed

dotnwat reviewed Jan 14, 2025

View reviewed changes

	model::offset translation_stm::max_collectible_offset() {
	if (!_raft->log_config().iceberg_enabled()) {
	return model::offset::max();
	}
	// if offset is not initialized, do not attempt translation.
	if (_highest_translated_offset == kafka::offset{}) {
	return model::offset{};
	}
	return _raft->log()->to_log_offset(
	kafka::offset_cast(_highest_translated_offset));
	}

		model::offset
		get_translated_log_offset(ss::shared_ptr<storage::log> log, kafka::offset o);

datalake: relax translation_stm::max_collectible_offset() value (and add compaction_test.py) #24610

datalake: relax translation_stm::max_collectible_offset() value (and add compaction_test.py) #24610

Conversation

WillemKauf commented Dec 18, 2024 • edited Loading

Backports Required

Release Notes

Improvements

Choose a reason for hiding this comment

vbotbuildovich commented Dec 19, 2024 • edited Loading

Retry command for Build#59935

vbotbuildovich commented Dec 19, 2024 • edited Loading

CI test results

WillemKauf commented Dec 19, 2024 • edited Loading

WillemKauf commented Dec 20, 2024

vbotbuildovich commented Dec 20, 2024

Retry command for Build#60016

WillemKauf commented Dec 20, 2024

WillemKauf commented Dec 23, 2024

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillemKauf Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharathv Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillemKauf commented Jan 7, 2025

WillemKauf commented Jan 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillemKauf commented Jan 9, 2025

andrwng commented Jan 9, 2025

WillemKauf commented Jan 9, 2025

WillemKauf commented Jan 10, 2025

WillemKauf commented Jan 10, 2025

bharathv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`datalake`: relax `translation_stm::max_collectible_offset()` value (and add `compaction_test.py`) #24610

`datalake`: relax `translation_stm::max_collectible_offset()` value (and add `compaction_test.py`) #24610

WillemKauf commented Dec 18, 2024 •

edited

Loading

vbotbuildovich commented Dec 19, 2024 •

edited

Loading

vbotbuildovich commented Dec 19, 2024 •

edited

Loading

WillemKauf commented Dec 19, 2024 •

edited

Loading

WillemKauf Jan 7, 2025 •

edited

Loading

bharathv Jan 7, 2025 •

edited

Loading