Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SUPPORT] Streaming batch duration increased 10x after increasing number of commits retained #12030

Open
sergiomartinswhg opened this issue Sep 30, 2024 · 3 comments
Labels
performance spark-streaming spark structured streaming related

Comments

@sergiomartinswhg
Copy link

Hi all 👋
I'm using Spark Structured Streaming to stream from one Hudi table to another Hudi table.
I noticed that when stream started for the first time, each batch was relatively fast, with an average duration of 30 seconds, but it increased over time and stabilized at 300 seconds.
I tested many configurations and came to a conclusion that the this only happens when I have hoodie.cleaner.hours.retained=72 (3 days). When I reverted to default (10 commits retained), batch latency returned back to 30 seconds... and increasing it again, made it increase during 3 days and stabilize again at 300 seconds.

Tried using async clean/archive with this lock concurrency settings:

hoodie.write.concurrency.mode=optimistic_concurrency_control
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider

It improved a bit, but still 200/250 seconds with 72h of commits.

In Spark UI, SQL/Dataframe tab, a normal batch looks like the screenshot:

image

in the example, batch took 3.3 min... but if I open all the 19 related Succeeded Jobs and SUM the time each one took, it sums up to only 30 seconds... apparently the rest of the time is spent by non-spark operations, but found nothing in the logs during these gaps (this cluster is running only this stream so it's not waiting for resources)... the Job that took most of the time (26 sec) has the description: "Preparing compaction metadata".

Regarding my .hoodie/ folder here are some stats:

  • whole hudi table folder is 28GB with 30k objects
  • .hoodie folder is 11GB with 29k objects!
  • around 5k files in .hoodie/ (filenames contain rollback, clean, commit, etc... )
  • 1347 files / 129 MB in .hoodie/archived/
  • most of data is in .hoodie/metadata/ , as per screenshot:

image

**Steps to reproduce the behavior: **

set hoodie.cleaner.hours.retained=72 and batch duration increased 10 times

Expected behavior

batch duration wouldn't increase significantly

Environment Description

  • Hudi 0.15, Spark 3.5.1, AWS S3; Running on Kubernetes
  • Average input rate of 50 records/second;
  • Table is COW and partitioned by day;
  • Metadata enabled, with column_stats, and record index;
@rangareddy
Copy link

rangareddy commented Oct 1, 2024

Hi @sergiomartinswhg

When you increase the number of commits retained in Hudi, it can lead to a significant increase in the streaming batch duration. This is because Hudi needs to maintain a larger commit history, which can cause the following issues:

  • Increased metadata overhead: With more commits retained, Hudi needs to store and manage more metadata, which can lead to increased overhead in terms of storage, memory, and processing time.
  • Slower commit processing: When there are more commits to process, Hudi's commit processing time increases, leading to slower streaming batch durations.
  • Higher memory usage: Retaining more commits requires more memory to store the commit history, which can lead to increased memory usage and potential memory issues.

The main reason to slowness is Hudi needs to load the equivalent number of commits to perform index lookups, a process essential for handling updates.

Could you please try with default settings:

  --hoodie-conf hoodie.cleaner.hours.retained=24
  --hoodie-conf hoodie.cleaner.parallelism=200

@sergiomartinswhg
Copy link
Author

Hi @rangareddy,
I'm keeping 72h of commits in order to have a 3 day window of changes, to allow downstream streams time to catch up in case of failure or so... similar to having 7 days of retention in Kafka. Because streaming from Hudi relies on Incremental Query feature, which in turn, uses retained commits for that.

So if I reduce it to 24h, if a downstream stops for than 24h, once I restart it, won't my downstream abort, or worst, lose data? Am I thinking correctly?

@rangareddy
Copy link

Hi @sergiomartinswhg

It is possible to share the Spark Logs (Spark Application logs and Spark Event logs) ?

@ad1happy2go ad1happy2go added performance spark-streaming spark structured streaming related labels Oct 1, 2024
@github-project-automation github-project-automation bot moved this to ⏳ Awaiting Triage in Hudi Issue Support Oct 1, 2024
@ad1happy2go ad1happy2go moved this from ⏳ Awaiting Triage to 👤 User Action in Hudi Issue Support Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance spark-streaming spark structured streaming related
Projects
Status: 👤 User Action
Development

No branches or pull requests

3 participants