[SUPPORT] Streaming batch duration increased 10x after increasing number of commits retained #12030

sergiomartinswhg · 2024-09-30T16:35:28Z

Hi all 👋
I'm using Spark Structured Streaming to stream from one Hudi table to another Hudi table.
I noticed that when stream started for the first time, each batch was relatively fast, with an average duration of 30 seconds, but it increased over time and stabilized at 300 seconds.
I tested many configurations and came to a conclusion that the this only happens when I have hoodie.cleaner.hours.retained=72 (3 days). When I reverted to default (10 commits retained), batch latency returned back to 30 seconds... and increasing it again, made it increase during 3 days and stabilize again at 300 seconds.

Tried using async clean/archive with this lock concurrency settings:

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider

It improved a bit, but still 200/250 seconds with 72h of commits.

In Spark UI, SQL/Dataframe tab, a normal batch looks like the screenshot:

in the example, batch took 3.3 min... but if I open all the 19 related Succeeded Jobs and SUM the time each one took, it sums up to only 30 seconds... apparently the rest of the time is spent by non-spark operations, but found nothing in the logs during these gaps (this cluster is running only this stream so it's not waiting for resources)... the Job that took most of the time (26 sec) has the description: "Preparing compaction metadata".

Regarding my .hoodie/ folder here are some stats:

whole hudi table folder is 28GB with 30k objects
.hoodie folder is 11GB with 29k objects!
around 5k files in .hoodie/ (filenames contain rollback, clean, commit, etc... )
1347 files / 129 MB in .hoodie/archived/
most of data is in .hoodie/metadata/ , as per screenshot:

**Steps to reproduce the behavior: **

set hoodie.cleaner.hours.retained=72 and batch duration increased 10 times

Expected behavior

batch duration wouldn't increase significantly

Environment Description

Hudi 0.15, Spark 3.5.1, AWS S3; Running on Kubernetes
Average input rate of 50 records/second;
Table is COW and partitioned by day;
Metadata enabled, with column_stats, and record index;

The text was updated successfully, but these errors were encountered:

rangareddy · 2024-10-01T08:33:46Z

Hi @sergiomartinswhg

When you increase the number of commits retained in Hudi, it can lead to a significant increase in the streaming batch duration. This is because Hudi needs to maintain a larger commit history, which can cause the following issues:

Increased metadata overhead: With more commits retained, Hudi needs to store and manage more metadata, which can lead to increased overhead in terms of storage, memory, and processing time.
Slower commit processing: When there are more commits to process, Hudi's commit processing time increases, leading to slower streaming batch durations.
Higher memory usage: Retaining more commits requires more memory to store the commit history, which can lead to increased memory usage and potential memory issues.

The main reason to slowness is Hudi needs to load the equivalent number of commits to perform index lookups, a process essential for handling updates.

Could you please try with default settings:

  --hoodie-conf hoodie.cleaner.hours.retained=24
  --hoodie-conf hoodie.cleaner.parallelism=200

sergiomartinswhg · 2024-10-01T10:13:35Z

Hi @rangareddy,
I'm keeping 72h of commits in order to have a 3 day window of changes, to allow downstream streams time to catch up in case of failure or so... similar to having 7 days of retention in Kafka. Because streaming from Hudi relies on Incremental Query feature, which in turn, uses retained commits for that.

So if I reduce it to 24h, if a downstream stops for than 24h, once I restart it, won't my downstream abort, or worst, lose data? Am I thinking correctly?

rangareddy · 2024-10-01T10:44:48Z

Hi @sergiomartinswhg

It is possible to share the Spark Logs (Spark Application logs and Spark Event logs) ?

ad1happy2go added performance spark-streaming spark structured streaming related labels Oct 1, 2024

github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Oct 1, 2024

ad1happy2go added this to Hudi Issue Support Oct 1, 2024

ad1happy2go moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Streaming batch duration increased 10x after increasing number of commits retained #12030

[SUPPORT] Streaming batch duration increased 10x after increasing number of commits retained #12030

sergiomartinswhg commented Sep 30, 2024

rangareddy commented Oct 1, 2024 •

edited

Loading

sergiomartinswhg commented Oct 1, 2024

rangareddy commented Oct 1, 2024

[SUPPORT] Streaming batch duration increased 10x after increasing number of commits retained #12030

[SUPPORT] Streaming batch duration increased 10x after increasing number of commits retained #12030

Comments

sergiomartinswhg commented Sep 30, 2024

rangareddy commented Oct 1, 2024 • edited Loading

sergiomartinswhg commented Oct 1, 2024

rangareddy commented Oct 1, 2024

rangareddy commented Oct 1, 2024 •

edited

Loading